0

I am implementing the SimHash algorithm [1] to deduplicate a dataset using MapReduce.

For example, if I have 3 documents Doc1, Doc2, Doc3, Doc4. Suppose that Doc1 is similar to Doc3 with a hamming distance less than 3. Then after doing deduplication the output "unique" dataset should be Doc1, Doc2 and Doc4.

My implementation involves converting each document hash into a 64-bit string and then partitioning this bit string into bands for further matching. for simplicity, let us say that:

Doc1 = band0+{101},band1+{110}
Doc2 = band0+{100},band1+{110}
Doc3 = band0+{101},band1+{110}
Doc4 = band0+{100},band1+{101}

If I grouped the documents according to the similar bands then, the candidates for similarity will be:

1st set: Doc1, Doc3
2nd set: Doc2, Doc4
3rd set: Doc1, Doc2, Doc3

so now all what I have to do is to calculate the hamming distance between each candidate in a single set.

I tried to implement the mapper in which:

Input:

Key is LongWritable offset

Value is the document text

Output:

Key is the band#+the bit string

Value is the document text.

But now I am confused about the reducer. I dont want to have conflicts in the final dataset, but what is the guarantee for that. I am confused about what should be the key value output?


Update (more explanation) If the reducer input Key is the band#+the bit string and the Value is the documents with the same band. For example

Band0+{101} = Doc1,Doc3

The hamming distance could be calaulated to know the duplicate documents. But each group (set) might have conflicts in one or more document as there is no guarantee that the same documents will end up in the same reducer.

For example, if the first group is Doc1,Doc2,Doc3 and the second group is Doc2,Doc3,Doc4. And Doc2 and Doc3 are duplicates How can I output the unique documents as Doc1,Doc3 and Doc4?

I came across these questions but they were not of much help to me:

  1. Deciding key value pair for deduplication using hadoop mapreduce
  2. How to implement LSH by MapReduce?

[1] M. Charikar, “Similarity Estimation Techniques from Rounding Algorithm,” Proc. of 34th Annual Symposium on Theory of Computing (STOC), 2008, pp 380-388.

Community
  • 1
  • 1
Daisy
  • 847
  • 3
  • 13
  • 34

1 Answers1

0

For each document you can emit 0 or more outputs then, you can do the follow:

Input1: Doc1
Outputs
  key1: band0101, value1: Doc1
  key2: band1110, value2: Doc1

(one output for each band)

Input2: Doc2
Outputs
  key1: band0100, value1: Doc2
  key2: band1110, value2: Doc2
.
.
.

With this approach in the reducers you will get a list of all Docs with key string band0101 grouped. And the same for band0100, band1110, etc.

RojoSam
  • 1,476
  • 12
  • 15
  • Thanks for your answer. Well you are right, and then I can measure the hamming distance between each group to know the duplicate documents. But each group might have conflicts in one or more document as there is no guarantee that the same documents will end up in the same reducer. For example, if the first group is Doc1,Doc2,Doc3 and the second group is Doc2,Doc3,Doc4. And Doc2 and Doc3 are duplicates How can I output the unique documents as Doc1,Doc3 and Doc4? – Daisy Sep 07 '15 at 06:23
  • One option is have more than one MapReduce job. In the first job you can identify the duplicated documents and in the second one you can filter the duplicated to remove them in the output. The challenge is always keep the same document and drop the other. – RojoSam Sep 07 '15 at 16:55
  • In your example: _OutputGroup1_ -> (Doc1ID, Doc1), (Doc2ID, ), (Doc3ID, Doc3) __ OutputGroup2 -> (Doc2ID, – RojoSam Sep 07 '15 at 17:05