稀疏相似度
难度:
标签:
题目描述
The similarity of two documents (each with distinct words) is defined to be the size of the intersection divided by the size of the union. For example, if the documents consist of integers, the similarity of {1, 5, 3} and {1, 7, 2, 3} is 0.4, because the intersection has size 2 and the union has size 5. We have a long list of documents (with distinct values and each with an associated ID) where the similarity is believed to be "sparse". That is, any two arbitrarily selected documents are very likely to have similarity 0. Design an algorithm that returns a list of pairs of document IDs and the associated similarity.
Input is a 2D array docs
, where docs[i]
is the document with id i
. Return an array of strings, where each string represents a pair of documents with similarity greater than 0. The string should be formatted as {id1},{id2}: {similarity}
, where id1
is the smaller id in the two documents, and similarity
is the similarity rounded to four decimal places. You can return the array in any order.
Example:
Input:
[
[14, 15, 100, 9, 3],
[32, 1, 9, 3, 5],
[15, 29, 2, 6, 8, 7],
[7, 10]
]
Output:
[
"0,1: 0.2500",
"0,2: 0.1000",
"2,3: 0.1429"
]
Note:
docs.length <= 500
docs[i].length <= 500
- The number of document pairs with similarity greater than 0 will not exceed 1000.
代码结果
运行时间: 252 ms, 内存: 47.4 MB
/*
* Problem Statement:
* The intersection count of elements in two documents divided by the union count of elements in two documents represents their similarity.
* For example, the similarity between {1, 5, 3} and {1, 7, 2, 3} is 0.4, where the intersection count is 2 and the union count is 5.
* Given a series of documents, each represented by a unique integer array associated with an ID, design an algorithm to return the IDs of each pair of documents and their similarity.
* Only output pairs with similarity greater than 0, ignoring empty documents.
* For simplicity, assume each document is represented by an array of distinct integers.
*
* Input: A 2D array 'docs', where docs[i] represents the document with ID 'i'.
* Output: An array of strings, each representing a pair of documents with similarity > 0 in the format '{id1},{id2}: {similarity}', where 'id1' is the smaller ID and 'similarity' is the similarity, precise to 4 decimal places.
*/
import java.util.*;
import java.util.stream.Collectors;
import java.util.stream.IntStream;
public class DocumentSimilarityStream {
public static List<String> calculateSimilarities(int[][] docs) {
return IntStream.range(0, docs.length)
.boxed()
.flatMap(i -> IntStream.range(i + 1, docs.length)
.mapToObj(j -> {
Set<Integer> set1 = Arrays.stream(docs[i]).boxed().collect(Collectors.toSet());
Set<Integer> set2 = Arrays.stream(docs[j]).boxed().collect(Collectors.toSet());
Set<Integer> intersection = new HashSet<>(set1);
intersection.retainAll(set2);
if (intersection.size() > 0) {
Set<Integer> union = new HashSet<>(set1);
union.addAll(set2);
double similarity = (double) intersection.size() / union.size();
return String.format("%d,%d: %.4f", i, j, similarity);
} else {
return null;
}
})
.filter(Objects::nonNull))
.collect(Collectors.toList());
}
public static void main(String[] args) {
int[][] docs = {
{14, 15, 100, 9, 3},
{32, 1, 9, 3, 5},
{15, 29, 2, 6, 8, 7},
{7, 10}
};
List<String> similarities = calculateSimilarities(docs);
similarities.forEach(System.out::println);
}
}
解释
方法:
时间复杂度:
空间复杂度: