3

I have ~25.000 distinct names in an SQL database, and would like to perform edit-distance comparison on all of these in order to normalize e.g. John Doe & Jhon Doe.

When the db was only around 1000 names I used to store all distinct names in an array. Then I would use two for-loops on that array, thereby comparing each element in the array to each of the others. When the edit-distance gave a match of say >0.9 I would execute an SQL-query substituting one value for the other in all records.

With my much larger database this is not possible anymore. What would you guys do?

ps: I'm also curious about any multithreaded solutions to this because the process is taking ages now.

pps: I'm coding in Java

Freek8
  • 694
  • 4
  • 11
  • 24
  • are the names in the same table? which function do you use to compare the names? – carpamon Jan 25 '12 at 22:26
  • Can this be possible on DB side? If so, I prefer that. Otherwise may be something like fork/join concept may be useful. – kosa Jan 25 '12 at 22:27
  • It is basically one large array of names that is compared against itself. I don't think this is possible on DB side because I am calculating a metric over each combination of two names to see if they are similar (to correct misspellings etc) – Freek8 Jan 25 '12 at 22:29

2 Answers2

1

There is no way around pairwise matching: the way as efficient as it gets.

If you need to do your record linkage faster, try using a string distance metrics that requires less computations than the edit distance (Bonacci distance, Jaro–Winkler distance, etc.)

You could also use another metric as a preprocessing step, and then compute edit distance to confirm or deny the match.

Sergey Kalinichenko
  • 714,442
  • 84
  • 1,110
  • 1,523
  • I am using Jaro-Winkler which is an edit distance ;) The problem is not so much in the time the computation of the jw-distance takes, but in the number of records. I'm looking for a way to split up the work in different threads for example – Freek8 Jan 25 '12 at 22:42
  • @Freek8 Oops, I thought [Levenshtein distance](http://en.wikipedia.org/wiki/Levenshtein_distance) was called the "edit distance" (but Wikipedia says that I was wrong, there are many ["edit distances"](http://en.wikipedia.org/wiki/Edit_distance) out there). Anyway, you have N*(N-1)/2 checks to make regardless of your metric; the only thing that could make your process go faster is calculating the metric much faster. – Sergey Kalinichenko Jan 25 '12 at 22:50
  • what do you think of the soundex approach by ChrisJ? I could group all the records by soundex and only do JW-distance per group. Perhaps I could give each group a seperate thread! – Freek8 Jan 25 '12 at 22:53
  • @Freek8 I would give it a try and see how it goes. You may get fewer matches than with your regular edit distance, but I am not sure if you have much choice there. – Sergey Kalinichenko Jan 25 '12 at 23:26
  • thanks! with my old approach I get out of memory problems, or it takes ages. Im gonna try this. – Freek8 Jan 25 '12 at 23:30
  • @Freek8 You should accept ChrisJ's answer then, because he was the one to bring up Soundex. Good luck! – Sergey Kalinichenko Jan 25 '12 at 23:32
1

What about computing the soundex of each of your names and possibly storing it in the database? You can even do that on DB side, for instance there's a MySQL SOUNDEX function.

After computing the soundex of each name, all you have to do is group the rows by identical soundex.

EDIT:

If soundex is too coarse for your application, you can first select candidates by comparing their soundexes, and use your usual metric on each set of candidates.

ChrisJ
  • 5,161
  • 25
  • 20
  • I don't really fancy Soundex, but I like your idea of using it to select candidates. Especially since there is a mysql soundex function available. – Freek8 Jan 25 '12 at 22:43