How to exclude duplicate records from a large data feed?

Question

I have started working with a large dataset that is arriving in JSON format. Unfortunately, the service providing the data feed delivers a non-trivial number of duplicate records. On the up-side, each record has a unique Id number stored as a 64 bit positive integer (Java long).

The data arrives once a week and is about 10M records in each delivery. I need to exclude duplicates from within the current delivery as well as records that were in previous batches.

The brute force approach to attacking the de-dup problem is push the Id number into a Java Set. Since the Set interface requires uniqueness, a failure during the insert will indicate a duplicate.

The question is: Is there a better way to look for a duplicate long as I import records?

I am using Hadoop to mine the data, so if there is a good way to use Hadoop to de-dup the records that would be a bonus.

score 5 · Accepted Answer · answered Sep 13 '11 at 00:22

5

Could you create a MapReduce task where the map output has a key of the unique ID number? That way, in your reduce task, you will be passed an iterator of all the values with that ID number. Output only the first value and your reduced output will be free of duplicates.

answered Sep 13 '11 at 00:22

salexander

954
5
10

That's the much better solution than Rolands. – Thomas Jungblut Sep 13 '11 at 08:24
1

@salexander - I can do that for the new batch of data, but how do I recognize duplicates of existing records from previous weeks. For example, previous data has IDs [1..10]; new data contains [11,11,9]. The map/reduce job will find the duplicate 11s, but 9 is unique in this set, but not when compared to the entire set. I know I could re-run across the whole dataset, but that seems silly. – gras Sep 14 '11 at 13:05
in this case we need to get all keys in a single reducer which may slow down the process – sunitha Apr 09 '14 at 13:07

score 1 · Answer 2 · answered Sep 12 '11 at 19:58

Let me see. Each java.lang.Long takes 24 bytes. Each HashMap$Entry takes 24 bytes as well, and the array for the HashMap takes 4 bytes. So you have 52 * 10M = 512M of heap storage for the map. This is for the 10M records of one week, though.

If you are on a 64-bit system, you can just set the heap size to 5 GB and see how far you get.

There should be other implementations of a java.util.Set that only consume about 16 bytes per entry, so you can handle three times the data as with a java.util.HashSet. I've written one myself, but I cannot share it. You may try GNU Trove instead.

score 0 · Answer 3 · answered Dec 10 '12 at 22:10

You have to keep list of unique ids in HDFS and rebuild it after every batch load.

As since the cardinality in your case is quite large (you can expect > 1B unique records in one year) your unique id list needs to be split into multiple parts, say N. The partition algorithm is domain specific. The general approach is to convert ID into long hash string (16 bytes is OK) and creates 2^k buckets:

For k =8, for example:

bucket #1 contains all IDs whose hash value starts with 0 bucket #2 contains all IDs whose hash value starts with 1 ... bucket #256 contains all IDs whose hash value starts with 255

On every new batch you receive run dedupe job first: Map reads records , takes record ID, hashes it and outputs Key=bucket# (0..255 in our case) and Value = ID. Each reducer receives all IDS for a given bucket. Reducer loads ALL unique IDs for a given bucket known in your system already into internal Set and checks ALL incoming record IDs with this this internal Set. If record has ID which s not known yet you update internal Set and output the record.

On reducer close you output internal Set of unique IDs back to HDFS.

By splitting the whole set of IDs into number of buckets you create solution which scales well.

How to exclude duplicate records from a large data feed?

3 Answers3