I have a file with an "X" number of names, i need to match each of those names against another file and see if said name is amongst them, but written in a different way ("Verizon" -> "Verizon LTD").
I was doing this with a the "Fuzzy Lookup" interop on the visual studio 2008, and was getting a good result.
Now I'm trying to implement the LevenshteinDistance method to achieve this result, so that the method iterates the name i need against the other file with the full list, and returns the name which has the best score/ probability of being the same.
The code I'm using is the following:
public static int LevenshteinDistance(string src, string dest)
{
int[,] d = new int[src.Length + 1, dest.Length + 1];
int i, j, cost;
for (i = 0; i <= src.Length; i++)
{
d[i, 0] = i;
}
for (j = 0; j <= dest.Length; j++)
{
d[0, j] = j;
}
for (i = 1; i <= src.Length; i++)
{
for (j = 1; j <= dest.Length; j++)
{
if (src[i - 1] == dest[j - 1])
cost = 0;
else
cost = 1;
d[i, j] = Math.Min(Math.Min(d[i - 1, j] + 1, d[i, j - 1] + 1), d[i - 1, j - 1] + cost);
}
}
return d[src.Length, dest.Length];
}
public static List<string> Search(string word, List<string> wordList, double fuzzyness)
{
List<string> foundWords = new List<string>();
foreach (string s in wordList)
{
// Calculate the Levenshtein-distance:
int levenshteinDistance =
LevenshteinDistance(word.ToUpper(), s.ToUpper());
// Length of the longer string:
int length = Math.Max(word.Length, s.Length);
// Calculate the score:
double score = 1.0 - (double)levenshteinDistance / length;
// Match?
if (score >= fuzzyness)
{
foundWords.Add(s);
}
}
return foundWords;
}
The following example is a test I ran in which the word we wanted to match was "ILCA INC", we ran it as follows:
Similarity set: >= 0.77
List of words for search "ILCA" 0.5 aprox --> This is the result we got with the VS2008 "Fuzzy Lookup". "ICE INC" 0.77 aprox --> This is the one brought by my code.
I would be really greatful if I could get any input on this subject, I'm having trouble getting this app to arrive to the same result at which the "Fuzzy Lookup" does.
Let me know if there is any more information I can provide, or if I have expressed my question wrong.