I have been trying to implement a levenshtein distance function in C++ that gives different weights to substitutions and insertions based on which characters are being replaced or inserted.
The cost is calculated based on the distance of the keys on an qwerty keyboard. For example, in the standard edit distance algorithm, the distance between google, hoogle, and zoogle is the same; 1. What I want is different distances for these. Something like google -> hoogle = 1, google -> zoogle = 4, hoogle -> zoogle = 5.
I followed the Wikipedia algorithm using the matrix for memoization and implemented it in c++. Here is my function.
int levDist(string s, string t) {
int i,j,m,n,temp,subsitutionCost, deletionCost, insertionCost, keyDist;
deletionCost = 1;
m = s.length();
n = t.length();
int d[m+1][n+1];
for(i=0;i<=m;i++)
d[i][0] = i;
for(j=0;j<=n;j++)
d[0][j] = j;
for (j=1;j<=n;j++)
{
for(i=1;i<=m;i++)
{
// getKeyboardDist(char a, char b) gives distance b/w the two keys
keyDist = getKeyboardDist(s[i-1],t[j-1]);
subsitutionCost = (s[i-1] == t[j-1]) ? 0 : keyDist;
// this line is the one i think the problem lies in
insertionCost = (i > j) ? getKeyboardDist(s[i-1],t[j-2]) : getKeyboardDist(s[i-2],t[j-1]);
insertionCost = insertionCost ? insertionCost : 1;
d[i][j] = min((d[i-1][j] + deletionCost),
min((d[i][j-1] + insertionCost),
(d[i-1][j-1] + subsitutionCost)));`
}
}
return d[m][n];
}
Now the subsitutions work properly I beleive, but the problem is the insertions. I dont know how to find which characters to get the distance between for insertions. Especially the cases when the insertion is in the beginning or end of the string.
I would appreciate any help in this, let me know if there is any other information needed.
Thanks in advance.