I have ~25.000 distinct names in an SQL database, and would like to perform edit-distance comparison on all of these in order to normalize e.g. John Doe & Jhon Doe.
When the db was only around 1000 names I used to store all distinct names in an array. Then I would use two for-loops on that array, thereby comparing each element in the array to each of the others. When the edit-distance gave a match of say >0.9 I would execute an SQL-query substituting one value for the other in all records.
With my much larger database this is not possible anymore. What would you guys do?
ps: I'm also curious about any multithreaded solutions to this because the process is taking ages now.
pps: I'm coding in Java