0

I have a file with an "X" number of names, i need to match each of those names against another file and see if said name is amongst them, but written in a different way ("Verizon" -> "Verizon LTD").

I was doing this with a the "Fuzzy Lookup" interop on the visual studio 2008, and was getting a good result.

Now I'm trying to implement the LevenshteinDistance method to achieve this result, so that the method iterates the name i need against the other file with the full list, and returns the name which has the best score/ probability of being the same.

The code I'm using is the following:

public static int LevenshteinDistance(string src, string dest)
{
    int[,] d = new int[src.Length + 1, dest.Length + 1];
    int i, j, cost;

    for (i = 0; i <= src.Length; i++)
    {
        d[i, 0] = i;
    }
    for (j = 0; j <= dest.Length; j++)
    {
        d[0, j] = j;
    }


    for (i = 1; i <= src.Length; i++)
    {
        for (j = 1; j <= dest.Length; j++)
        {

            if (src[i - 1] == dest[j - 1])
                cost = 0;
            else
                cost = 1;

            d[i, j] = Math.Min(Math.Min(d[i - 1, j] + 1, d[i, j - 1] + 1), d[i - 1, j - 1] + cost);

        }
    }

    return d[src.Length, dest.Length];
}

public static List<string> Search(string word, List<string> wordList, double fuzzyness)
{
    List<string> foundWords = new List<string>();

    foreach (string s in wordList)
    {
        // Calculate the Levenshtein-distance:
        int levenshteinDistance =
            LevenshteinDistance(word.ToUpper(), s.ToUpper());

        // Length of the longer string:
        int length = Math.Max(word.Length, s.Length);

        // Calculate the score:
        double score = 1.0 - (double)levenshteinDistance / length;

        // Match?
        if (score >= fuzzyness)
        {
            foundWords.Add(s);
        }
    }
    return foundWords;
}

The following example is a test I ran in which the word we wanted to match was "ILCA INC", we ran it as follows:

Similarity set: >= 0.77

List of words for search "ILCA" 0.5 aprox --> This is the result we got with the VS2008 "Fuzzy Lookup". "ICE INC" 0.77 aprox --> This is the one brought by my code.

I would be really greatful if I could get any input on this subject, I'm having trouble getting this app to arrive to the same result at which the "Fuzzy Lookup" does.

Let me know if there is any more information I can provide, or if I have expressed my question wrong.

juharr
  • 31,741
  • 4
  • 58
  • 93
Patrick
  • 1
  • 2
  • 1
    http://www.catalysoft.com/articles/StrikeAMatch.html – Eser Aug 07 '15 at 13:25
  • It might help you if you used meaningful names for your variables. And some comments wouldn't hurt, either. –  Aug 07 '15 at 13:33
  • A few questions: Are you sure your Levenshtein distance algorithm is correct? Have you tested it against a bunch of words and ensured that it's returning the correct edit distance? Are you sure that the Fuzzy Lookup tool you're using is backed by the same algorithm? Are you certain that the score returned by the fuzzy lookup tool is `1 - EditDistance / WordLength`? – w.brian Aug 07 '15 at 13:36
  • @Amy That's just the standard [Levenshtein distance](https://en.wikipedia.org/wiki/Levenshtein_distance) algorithm. The names in the `Search` method seem pretty good to me. – juharr Aug 07 '15 at 13:40
  • @w.brian Hey! Thanks for your quick answer. I'm sure that the algorithm is correct, I tested it and it returns a correct edit distance. We don't know how the Fuzzy Lookup is backed on, we used the Levenshtein distance in an attempt to mimic the result we got from the "fuzzy lookup", I'm still trying to find out how it works. Since we don't know the algorithm it's backed on, we can only compare the results from it against the one from our app, and there's where there is a discrepancy, – Patrick Aug 07 '15 at 14:03

2 Answers2

0

Based on the results, the Microsoft's fuzzy search result is not as simple as 1 - EditDistance / WordLength. The edit distance between "ILCA INC" and "ICE INC" is 2 -- one insertion and one substitution. This is fewer edits than the better result returned by Microsoft's Fuzzy Lookup.

While Fuzzy Lookup may be using edit distance as part of it's equation, I'd assume that the overall method for determining a matching score is proprietary and both algorithmic and heuristic in nature. As you can probably tell, Fuzzy Lookup is prioritizing a word with a substring match starting at 0 over words with a lower edit distance.

w.brian
  • 16,296
  • 14
  • 69
  • 118
0

It can be very handy to write a debugging routine to dump the contents of the d array so that you can ensure that it works. For example, for the compare you mention:

    I C E   I N C
  0 1 2 3 4 5 6 7
I 1 0 1 2 3 4 5 6
L 2 1 1 2 3 4 5 6
C 3 2 1 2 3 4 5 5
A 4 3 2 2 3 4 5 6
  5 4 3 3 2 3 4 5
I 6 5 4 4 3 2 3 4
N 7 6 5 5 4 3 2 3
C 8 7 6 6 5 4 3 2

As another poster mentioned, the distance of 2 is correct, there are 2 typos in your compare (dropped L and E for A). I get .75 for the score, I'm not sure how you got .77.

I'd be willing to bet that the Microsoft algorithm is calculating the score differently. It may be taking the minimum or average of the two lengths rather than the maximum, as you do.

The calculation of the 'percent correct' with algorithms such as Levenshtein is a difficult issue. As you can see with your example, short string compares yield wild swings in percentages, and using a threshold that works well for longer compares doesn't work well with shorter (and vice-versa).

Threshold Ramp: Your current decision-making logic uses a constant value, regardless the input string lengths. However, it is sometimes more practical to use a 'ramp', where the over/under value varies depending on the string lengths. For example, you might decide that strings under three characters must match completely (100%), four character strings must match over 70%, five characters strings at 75%, six at 80%, etc. At some point (after about 8-10 characters), you can usually stick with a single value.

The implementation is fairly simple, using a int[] lookup table:

double[] thresholds=new double[] {100, 100, 100, 70, 75, 80, (etc) };
double targetThreshold=thresholds[Math.Max(src.Length,dest.Length)-1];

...

if (score >= targetThreshold)
  foundWords.Add(s);

(using threshhold values that are suitable for your needs). It usually delivers a more practical result.

The downside to this technique is that it is difficult to code if you want a truly variable threshold percentage. As you see in my example, I'm ignoring the fuzzyness input parameter.

Marc Bernier
  • 2,928
  • 27
  • 45
  • Hi @Marc, thank you for you answer, sorry for the delay. You are right about the score values, I indeed used '.75', it was a typo. Regarding the last bit of your answer, I'm looking into the 'treshold ramp' code and i don't quite get how to it works, because I've never used that logic before. Could you please expand more about it? Thanks in advance. – Patrick Aug 10 '15 at 19:27