I am implementing the SimHash algorithm [1] to deduplicate a dataset using MapReduce.
For example, if I have 3 documents Doc1, Doc2, Doc3, Doc4. Suppose that Doc1 is similar to Doc3 with a hamming distance less than 3. Then after doing deduplication the output "unique" dataset should be Doc1, Doc2 and Doc4.
My implementation involves converting each document hash into a 64-bit string and then partitioning this bit string into bands for further matching. for simplicity, let us say that:
Doc1 = band0+{101},band1+{110}
Doc2 = band0+{100},band1+{110}
Doc3 = band0+{101},band1+{110}
Doc4 = band0+{100},band1+{101}
If I grouped the documents according to the similar bands then, the candidates for similarity will be:
1st set: Doc1, Doc3
2nd set: Doc2, Doc4
3rd set: Doc1, Doc2, Doc3
so now all what I have to do is to calculate the hamming distance between each candidate in a single set.
I tried to implement the mapper in which:
Input:
Key is LongWritable offset
Value is the document text
Output:
Key is the band#+the bit string
Value is the document text.
But now I am confused about the reducer. I dont want to have conflicts in the final dataset, but what is the guarantee for that. I am confused about what should be the key value output?
Update (more explanation) If the reducer input Key is the band#+the bit string and the Value is the documents with the same band. For example
Band0+{101} = Doc1,Doc3
The hamming distance could be calaulated to know the duplicate documents. But each group (set) might have conflicts in one or more document as there is no guarantee that the same documents will end up in the same reducer.
For example, if the first group is Doc1,Doc2,Doc3 and the second group is Doc2,Doc3,Doc4. And Doc2 and Doc3 are duplicates How can I output the unique documents as Doc1,Doc3 and Doc4?
I came across these questions but they were not of much help to me:
[1] M. Charikar, “Similarity Estimation Techniques from Rounding Algorithm,” Proc. of 34th Annual Symposium on Theory of Computing (STOC), 2008, pp 380-388.