I have started working with a large dataset that is arriving in JSON format. Unfortunately, the service providing the data feed delivers a non-trivial number of duplicate records. On the up-side, each record has a unique Id number stored as a 64 bit positive integer (Java long).
The data arrives once a week and is about 10M records in each delivery. I need to exclude duplicates from within the current delivery as well as records that were in previous batches.
The brute force approach to attacking the de-dup problem is push the Id number into a Java Set. Since the Set interface requires uniqueness, a failure during the insert will indicate a duplicate.
The question is: Is there a better way to look for a duplicate long as I import records?
I am using Hadoop to mine the data, so if there is a good way to use Hadoop to de-dup the records that would be a bonus.