Large Array Comparison

Question

I have ~25.000 distinct names in an SQL database, and would like to perform edit-distance comparison on all of these in order to normalize e.g. John Doe & Jhon Doe.

When the db was only around 1000 names I used to store all distinct names in an array. Then I would use two for-loops on that array, thereby comparing each element in the array to each of the others. When the edit-distance gave a match of say >0.9 I would execute an SQL-query substituting one value for the other in all records.

With my much larger database this is not possible anymore. What would you guys do?

ps: I'm also curious about any multithreaded solutions to this because the process is taking ages now.

pps: I'm coding in Java

are the names in the same table? which function do you use to compare the names? — carpamon, Jan 25 '12 at 22:26
Can this be possible on DB side? If so, I prefer that. Otherwise may be something like fork/join concept may be useful. — kosa, Jan 25 '12 at 22:27
It is basically one large array of names that is compared against itself. I don't think this is possible on DB side because I am calculating a metric over each combination of two names to see if they are similar (to correct misspellings etc) — Freek8, Jan 25 '12 at 22:29

score 1 · Answer 1 · answered Jan 25 '12 at 22:37

1

There is no way around pairwise matching: the way as efficient as it gets.

If you need to do your record linkage faster, try using a string distance metrics that requires less computations than the edit distance (Bonacci distance, Jaro–Winkler distance, etc.)

You could also use another metric as a preprocessing step, and then compute edit distance to confirm or deny the match.

answered Jan 25 '12 at 22:37

Sergey Kalinichenko

714,442
84
1,110
1,523

I am using Jaro-Winkler which is an edit distance ;) The problem is not so much in the time the computation of the jw-distance takes, but in the number of records. I'm looking for a way to split up the work in different threads for example – Freek8 Jan 25 '12 at 22:42
@Freek8 Oops, I thought [Levenshtein distance](http://en.wikipedia.org/wiki/Levenshtein_distance) was called the "edit distance" (but Wikipedia says that I was wrong, there are many ["edit distances"](http://en.wikipedia.org/wiki/Edit_distance) out there). Anyway, you have N*(N-1)/2 checks to make regardless of your metric; the only thing that could make your process go faster is calculating the metric much faster. – Sergey Kalinichenko Jan 25 '12 at 22:50
what do you think of the soundex approach by ChrisJ? I could group all the records by soundex and only do JW-distance per group. Perhaps I could give each group a seperate thread! – Freek8 Jan 25 '12 at 22:53
@Freek8 I would give it a try and see how it goes. You may get fewer matches than with your regular edit distance, but I am not sure if you have much choice there. – Sergey Kalinichenko Jan 25 '12 at 23:26
thanks! with my old approach I get out of memory problems, or it takes ages. Im gonna try this. – Freek8 Jan 25 '12 at 23:30
@Freek8 You should accept ChrisJ's answer then, because he was the one to bring up Soundex. Good luck! – Sergey Kalinichenko Jan 25 '12 at 23:32

score 1 · Accepted Answer · answered Jan 25 '12 at 22:38

1

What about computing the soundex of each of your names and possibly storing it in the database? You can even do that on DB side, for instance there's a MySQL SOUNDEX function.

After computing the soundex of each name, all you have to do is group the rows by identical soundex.

EDIT:

If soundex is too coarse for your application, you can first select candidates by comparing their soundexes, and use your usual metric on each set of candidates.

answered Jan 25 '12 at 22:38

ChrisJ

5,161
25
20

I don't really fancy Soundex, but I like your idea of using it to select candidates. Especially since there is a mysql soundex function available. – Freek8 Jan 25 '12 at 22:43

Large Array Comparison

2 Answers2