稀疏相似度

难度:

标签:

题目描述

The similarity of two documents (each with distinct words) is defined to be the size of the intersection divided by the size of the union. For example, if the documents consist of integers, the similarity of {1, 5, 3} and {1, 7, 2, 3} is 0.4, because the intersection has size 2 and the union has size 5. We have a long list of documents (with distinct values and each with an associated ID) where the similarity is believed to be "sparse". That is, any two arbitrarily selected documents are very likely to have similarity 0. Design an algorithm that returns a list of pairs of document IDs and the associated similarity.

Input is a 2D array docs, where docs[i] is the document with id i. Return an array of strings, where each string represents a pair of documents with similarity greater than 0. The string should be formatted as {id1},{id2}: {similarity}, where id1 is the smaller id in the two documents, and similarity is the similarity rounded to four decimal places. You can return the array in any order.

Example:

Input: 
[
  [14, 15, 100, 9, 3],
  [32, 1, 9, 3, 5],
  [15, 29, 2, 6, 8, 7],
  [7, 10]
]
Output:
[
  "0,1: 0.2500",
  "0,2: 0.1000",
  "2,3: 0.1429"
]

Note:

docs.length <= 500
docs[i].length <= 500
The number of document pairs with similarity greater than 0 will not exceed 1000.

代码结果

运行时间: 252 ms, 内存: 47.4 MB

/*
 * Problem Statement:
 * The intersection count of elements in two documents divided by the union count of elements in two documents represents their similarity. 
 * For example, the similarity between {1, 5, 3} and {1, 7, 2, 3} is 0.4, where the intersection count is 2 and the union count is 5. 
 * Given a series of documents, each represented by a unique integer array associated with an ID, design an algorithm to return the IDs of each pair of documents and their similarity.
 * Only output pairs with similarity greater than 0, ignoring empty documents. 
 * For simplicity, assume each document is represented by an array of distinct integers.
 *
 * Input: A 2D array 'docs', where docs[i] represents the document with ID 'i'. 
 * Output: An array of strings, each representing a pair of documents with similarity > 0 in the format '{id1},{id2}: {similarity}', where 'id1' is the smaller ID and 'similarity' is the similarity, precise to 4 decimal places.
 */
import java.util.*;
import java.util.stream.Collectors;
import java.util.stream.IntStream;

public class DocumentSimilarityStream {
    public static List<String> calculateSimilarities(int[][] docs) {
        return IntStream.range(0, docs.length)
            .boxed()
            .flatMap(i -> IntStream.range(i + 1, docs.length)
                .mapToObj(j -> {
                    Set<Integer> set1 = Arrays.stream(docs[i]).boxed().collect(Collectors.toSet());
                    Set<Integer> set2 = Arrays.stream(docs[j]).boxed().collect(Collectors.toSet());

                    Set<Integer> intersection = new HashSet<>(set1);
                    intersection.retainAll(set2);

                    if (intersection.size() > 0) {
                        Set<Integer> union = new HashSet<>(set1);
                        union.addAll(set2);

                        double similarity = (double) intersection.size() / union.size();
                        return String.format("%d,%d: %.4f", i, j, similarity);
                    } else {
                        return null;
                    }
                })
                .filter(Objects::nonNull))
            .collect(Collectors.toList());
    }

    public static void main(String[] args) {
        int[][] docs = {
            {14, 15, 100, 9, 3},
            {32, 1, 9, 3, 5},
            {15, 29, 2, 6, 8, 7},
            {7, 10}
        };
        List<String> similarities = calculateSimilarities(docs);
        similarities.forEach(System.out::println);
    }
}

解释

方法:

本题解采用了倒排索引的思想。首先，统计每个文档的长度并存储在len_map中。然后，遍历每个文档中的每个词，利用cnt字典记录每个词出现在哪些文档中。在遍历的过程中，如果当前词在之前的文档中出现过，那么就更新sim_cnt字典，其中sim_cnt的键是文档对的元组(id1, id2)，值是这对文档的交集元素数量。最后，根据sim_cnt和len_map计算每对文档的相似度，并格式化输出。

时间复杂度:

O(d^2n)

空间复杂度:

O(dn + d^2)

代码细节讲解

🦆

在构建倒排索引时，你是如何决定将哪些信息存储在`cnt`字典中的？具体而言，为何选择记录每个词出现在哪些文档中而不是其他信息？

▷

在构建倒排索引时，选择将每个词出现在哪些文档中的信息存储在`cnt`字典中是为了快速定位和计算文档间的相似度。记录每个词出现在哪些文档中可以直接用来求解文档对之间的交集数目，这是计算Jaccard相似度的关键步骤。如果存储其他信息，比如词频或文档中的位置，虽然可以提供更多细节，但对于计算文档间的相似度并不直接有用，且会增加额外的处理复杂度和空间消耗。

🦆

在算法中，为什么要单独维护一个`len_map`来记录每个文档的长度，这一步是否可以通过其他方式简化或优化？

▷

在算法中，维护一个`len_map`来记录每个文档的长度是为了在计算相似度时快速获取文档的长度信息，这是计算Jaccard相似度公式中所需的分母部分。这一步可以通过在遍历文档时即时计算长度并存储来实现，但使用`len_map`可以使得代码更清晰和高效，因为每个文档的长度只需要计算一次即可重复使用。如果不使用`len_map`，每次计算相似度时都需要重新计算文档长度，将增加计算负担。

🦆

在计算相似度时，公式`val/(len_map[id1]+len_map[id2]-val)+1e-9`中添加`1e-9`的原因是什么？这种处理对结果的精度有何影响？

▷

在计算相似度时，添加`1e-9`是为了避免在Python中由于浮点数精度问题可能导致的除零错误或不稳定的数值表现。这个小的数值（称为epsilon）确保了分母不会为零，从而使得结果更稳定。对于结果的精度，这种处理通常影响极小，因为`1e-9`非常接近于零，其对最终相似度的影响通常低于实际精度需求，但可以在极端情况下防止程序错误。

稀疏相似度

题目描述

代码结果

解释

代码细节讲解

相关问题