java efficient de-duplication

Question

Lets say you have a large text file. Each row contains an email id and some other information (say some product-id). Assume there are millions of rows in the file. You have to load this data in a database. How would you efficiently de-dup data (i.e. eliminate duplicates)?

What do you want to do with the other information on each row? Are you trying to normalize the data into a master-detail table structure, or do you just want one record in a single table for each unique email-id? If you want to do the latter does it matter which of the rows with each unique email-id are put in the database? — ChrisH, Feb 25 '10 at 06:36
@ChrisH: I guess what you are asking is whether and is a duplicate? I am actually interested in both the cases. Case I: where and is a duplicate, ie. dups are checked only by email-id and Case II where this is not a duplicate so a duplicate is when both email-id, prod-id are same. — mnquasar, Feb 25 '10 at 06:47

Enno Shioji · Answer 1 · 2010-02-25T08:16:32.553

Insane number of rows

Use Map&Reduce framework (e.g. Hadoop). This is a full-blown distributed computing so it's an overkill unless you have TBs of data though. ( j/k :) )

Unable to fit all rows in memory

Even the result won't fit : Use merge sort, persisting intermediate data to disk. As you merge, you can discard duplicates (probably this sample helps). This can be multi-threaded if you want.
The results will fit : Instead of reading everything in-memory and then put it in a HashSet (see below), you can use a line iterator or something and keep adding to this HashSet. You can use ConcurrentHashMap and use more than one thread to read files and add to this Map. Another multi-threaded option is to use ConcurrentSkipListSet. In this case, you will implement compareTo() instead of equals()/hashCode() (compareTo()==0 means duplicate) and keep adding to this SortedSet.

Fits in memory

Design an object that holds your data, implement a good equals()/hashCode() method and put them all in a HashSet.
Or use the methods given above (you probably don't want to persist to disk though).

Oh and if I were you, I will put the unique constraint on the DB anyways...

I think that just about covers it all – Gareth Davis Feb 25 '10 at 07:51 — Gareth Davis, Feb 25 '10 at 07:51
I like the file based merge sort solution. Thanks Zwei. – mnquasar Feb 26 '10 at 19:01 — mnquasar, Feb 26 '10 at 19:01

score 1 · Answer 2 · answered Feb 25 '10 at 05:40

1

I will start with the obvious answer. Make a hashmap and put the email id in as the key and the rest of the information in to the value (or make an object to hold all the information). When you get to a new line, check to see if the key exists, if it does move to the next line. At the end write out all your SQL statements using the HashMap. I do agree with eqbridges that memory constraints will be important if you have a "gazillion" rows.

answered Feb 25 '10 at 05:40

TheSteve0

3,530
1
19
25

2

I would prefer to handle exact duplicates like this on the database side. Set UNIQUE constraints on columns that should be unique. Run INSERTs as often as you want, and duplicates will fail. Furthermore, you can query your accepted data for near duplicates, and UPDATE it as necessary. The memory requirements would be relatively small. – Dolph Feb 25 '10 at 05:43
You could just use a HashSet instead of relying on a Map. – Edward Q. Bridges Feb 25 '10 at 05:45
I do like the DB answer better but I agree that it would probably be slower to process the rows initially but overall I bet it would come out about even. You could do all the insert statements in a transaction and then just roll back if it fails. It would certainly give more flexibility as well and avoid the memory issues. Dolph you should have left it as an answer so I could vote it up – TheSteve0 Feb 25 '10 at 05:51
The only problem I see with a single transaction is you wouldn't be eliminating duplicates but aborting the whole batch if you encountered one. – PSpeed Feb 25 '10 at 06:30
I am sorry - I was not clear in my comment at all. I meant that each insert got it's own transaction. So you would be doing millions of transactions - NOT one big one. – TheSteve0 Feb 25 '10 at 06:32
thanks for the answers folks. But I dont think the problem is that simple. First question is will it be possible to keep the whole data in memory (the HashSet)...remember there are millions of rows..even before that can you load the file in memory? I guess the whole file does not need to be in memory, the reader would do buffered reads of a few blocks (lines) at a time...so we might be covered there?...but is there a way to make the HashSet/Map based approach more efficient? – mnquasar Feb 25 '10 at 06:32

score 1 · Answer 3 · answered Feb 25 '10 at 05:43

1

You have two options,

do it in Java: you could put together something like a HashSet for testing - adding an email id for each item that comes in if it doesnt exist in the set.
do it in the database: put a unique constraint on the table, such that dups will not be added to the table. An added bonus to this is that you can repeat the process and remove dups from previous runs.

answered Feb 25 '10 at 05:43

akf

38,619
8
86
96

for #2, the load could be interrupted by unique constraint violations. Depending on how big the set is and how many dupes, this could unnecessarily increase load time. – Edward Q. Bridges Feb 25 '10 at 05:46
yes, this is a problem with the unique constraint approach...so although I like the idea...you would have to catch unique constraint exception in order to avoid interruptions. – mnquasar Feb 25 '10 at 06:37

score 1 · Answer 4 · answered May 08 '14 at 16:15

Take a look at Duke (https://github.com/larsga/Duke) a fast dedupe and record linkage engine written in java. It uses Lucene to index and reduce the number of comparison (to avoid the unacceptable Cartesian product comparison). It supports the most common algorithm (edit distance, jaro winkler, etc) and it is extremely extensible and configurable.

score 0 · Answer 5 · answered Feb 25 '10 at 07:30

0

Can you not index the table by email and product ID? Then reading by index should make duplicates of either email or email+prodId readily identified via sequential reads and simply matching the previous record.

answered Feb 25 '10 at 07:30

Lawrence Dol

63,018
25
139
189

score 0 · Answer 6 · answered Feb 25 '10 at 08:02

0

Your problem can be solve with a Extract, transform, load (ETL) approach:

You load your data in an import schema;
Do every transformation you like on the data;
Then load it into the target database schema.

You can do this manually or use an ETL tool.

answered Feb 25 '10 at 08:02

Thomas Jung

32,428
9
84
114

java efficient de-duplication

6 Answers6

Insane number of rows

Unable to fit all rows in memory

Fits in memory