3

I have been trying to implement a levenshtein distance function in C++ that gives different weights to substitutions and insertions based on which characters are being replaced or inserted.

The cost is calculated based on the distance of the keys on an qwerty keyboard. For example, in the standard edit distance algorithm, the distance between google, hoogle, and zoogle is the same; 1. What I want is different distances for these. Something like google -> hoogle = 1, google -> zoogle = 4, hoogle -> zoogle = 5.

I followed the Wikipedia algorithm using the matrix for memoization and implemented it in c++. Here is my function.

int levDist(string s, string t) {

    int i,j,m,n,temp,subsitutionCost, deletionCost, insertionCost, keyDist;
    deletionCost = 1;

    m = s.length();
    n = t.length();
    int d[m+1][n+1];

    for(i=0;i<=m;i++)
        d[i][0] = i;
    for(j=0;j<=n;j++)
        d[0][j] = j;

    for (j=1;j<=n;j++)
    {
        for(i=1;i<=m;i++)
        {
            // getKeyboardDist(char a, char b) gives distance b/w the two keys
            keyDist = getKeyboardDist(s[i-1],t[j-1]); 

            subsitutionCost = (s[i-1] == t[j-1]) ? 0 : keyDist;

            // this line is the one i think the problem lies in
            insertionCost = (i > j) ? getKeyboardDist(s[i-1],t[j-2]) : getKeyboardDist(s[i-2],t[j-1]);


            insertionCost = insertionCost ? insertionCost : 1;

            d[i][j] = min((d[i-1][j]   + deletionCost),
                      min((d[i][j-1]   + insertionCost),
                          (d[i-1][j-1] + subsitutionCost)));`
        }
    }
    return d[m][n];
}

Now the subsitutions work properly I beleive, but the problem is the insertions. I dont know how to find which characters to get the distance between for insertions. Especially the cases when the insertion is in the beginning or end of the string.

I would appreciate any help in this, let me know if there is any other information needed.

Thanks in advance.

KaziJehangir
  • 295
  • 1
  • 3
  • 9

2 Answers2

3

What you're trying to do makes sense for substitutions. You're hypothesizing that a person attempting to strike a key X is more likely to make a mistake by striking a key physically near X than one that's far away.

It doesn't make so much sense for insertions and deletions because the act of striking an extra key (insertion error) or skipping a key strike (deletion error) doesn't have anything obvious to do with key distances.

It's possible you're being misdirected by the two different meanings of "distance" in play here. Levenshtein distance is measured between strings in insert/substitute/delete operations. Keyboard distance is a physical separation. These are apples and oranges that happen to be described with the same word. They don't mix very well.

You are attempting to determine weights for the Levenshtein operations. Physical distances between keys make reasonable weights for substitution.

Weights for insert and delete - which involve only one character each - don't have any obvious relationship to physical separation.

What you'd really want is frequency data regarding which keys people actually insert and delete by mistake. You'd give most common relatively low weights and the least common high ones.

@user6952491's idea that repeating the previous key might be a high frequency insert error has merit, but it's hard to extend this to a complete weighting scheme.

If you're in a guessing mood, you could hypothesize that it's easier to mistakenly insert a key near the middle of the keyboard than at the edges. Say f and j get lowest weights and characters like ~ that are shifted and at the keyboard extremes get high weights because it's unlikely you'd make the physical motions to type them without thinking.

I'll leave it to you to formulate a similar guess about deletions.

For general typing, my guess is that keyboarding errors will have at least as much to do with spelling misapprehensions as physical mistakes. I.e., people will type "recieve" because they forgot the rule "i before e except after c," not due to the keyboard position of i relative to e.

Other kinds of typing, e.g. computer code, might well have entirely different patterns for making mistakes. Forgotten semicolons come to mind! Those would have very low weights!

Consequently, I'm virtually certain that the suggestions provided by modern spell checkers are rooted in machine learning algorithms that draw conclusions from mistakes that many thousands of people have made in the past on similar tasks, not simple metrics based on keyboard distance.

Gene
  • 46,253
  • 4
  • 58
  • 96
  • Thank you for the long and insightful reply. I agree that it makes more sense for substitutions than for insertions and deletions. I apologize if my question wasn't clear about it, but yes what I wanted to do was to have separate weights (or costs) for insertions and substitutions based on which keys the involve. The edit distance will be between two strings of course. Your ideas about the frequency of making typos is good. It is similar to what I am trying to do ultimately; finding what features make a string more likely to be typo-ed, based on real world typing data. – KaziJehangir Oct 13 '16 at 12:14
0

if you put all the keys in a graph, you could easily calculate their distance. it would even be easier if you build it as simple neighbor list or a matrix with values you insert yourself.

in my opinion insertion should be counted as the minimum distance between the right and the left (if exist) letters + 1, because google and gooogle are pretty similar, but google and gowogle is very far. therefore google->googl: =7.

yd1
  • 277
  • 4
  • 13
  • Hi thanks for the reply! I have saved the key coordinates in a hash map and can calculate the distances between any two keys. Your idea about the left and right one is what I thought of as well. Just couldn't wrap my head around how to put that in code in the context of the substring matrix. – KaziJehangir Oct 13 '16 at 10:36
  • i think you can mainly copy the code from wikipedia: https://en.wikipedia.org/wiki/Levenshtein_distance and change increase of decrease of 1 to a function called `int calcDistance(key1, key2)` – yd1 Oct 13 '16 at 14:01