3

I am using the Levenshtein distance algorithm to compare a company name provided as a user input against a database of known company names to find closest match. By itself, the algorithm works okay, but I want to build in a Bias so that the edit distance is considered lower if the initial parts of the strings match.

For Example, if the search criteria is "ABCD", then both "ABCD Co." and "XYX ABCD" have identical Edit Distance. However I want to add weight to the fact that the initial parts of the first string matches the search criteria more closely than the second string.

One way of doing this might be to modify the insert/delete/replace costs to be higher at the beginning of the strings and lower towards the end. Does anyone have an example of a successful implementation of this? Is using Levenshtein distance still the best way to do what I am trying to achieve? Is my assumption of the approach accurate?

UPDATE: For my immediate purposes I have decided to forgo the above and instead use the Jaro Winkler edit distance which seems to solve the problem. However I will leave this open for further inputs.

tshepang
  • 12,111
  • 21
  • 91
  • 136
user1368587
  • 321
  • 1
  • 3
  • 5
  • 1
    im looking for the same thing... you got any luck with your solution? maybe you could provide some code sample? – Leonardo Mar 27 '13 at 22:52

1 Answers1

1

What you're looking for looks like a Smith-Waterman local alignment: http://en.wikipedia.org/wiki/Smith%E2%80%93Waterman_algorithm

Pierre
  • 34,472
  • 31
  • 113
  • 192
  • Hi Pierre. This algorithm indeed looks interesting. However I am not sure whether something used for gene-sequence matching will apply as well to matching strings containing company names. Ultimately the result needs to be translated into a normalized match-percentage which indicates likeness of the two strings, while weighing more if the initial sequences match. – user1368587 May 04 '12 at 22:32