2

Given two bit strings, x and y, with x longer than y, I'd like to compute a kind of asymmetric variant of the Levensthein distance between them. Starting with x, I'd like to know the minimum number of deletions and substitutions it takes to turn x into y.

Can I just use the usual Levensthein distance for this, or do I need I need to modify the algorithm somehow? In other words, with the usual set of edits of deletion, substitution, and addition, is it ever beneficial to delete more than the difference in lengths between the two strings and then add some bits back? I suspect the answer is no, but I'm not sure. If I'm wrong, and I do need to modify the definition of Levenshtein distance to disallow deletions, how do I do so?

Finally, I would expect intuitively that I'd get the same distance if I started with y (the shorter string) and only allowed additions and substitutions. Is this right? I've got a sense for what these answers are, I just can't prove them.

dextrous
  • 55
  • 4
  • I think the regular old Levenshtein distance does what you are describing you want. It can handle strings of unequal length. – hatchet - done with SOverflow Nov 29 '14 at 23:39
  • Yes, I know it can, that's not what I'm asking. I'm asking if the Levenshtein distance between two string of unequal length might sometimes be *different* from the distance obtained by only applying deletions and substitutions to the longer of the two strings. Hence my second paragraph. – dextrous Nov 29 '14 at 23:59

1 Answers1

2

If i understand you correctly, I think the answer is yes, the Levenshtein edit distance could be different than an algorithm that only allows deletions and substitutions to the larger string. Because of this, you would need to modify, or create a different algorithm to get your limited version.

Consider the two strings "ABCD" and "ACDEF". The Levenshtein distance is 3 (ABCD->ACD->ACDE->ACDEF). If we start with the longer string, and limit ourselves to deletions and substitutions we must use 4 edits (1 deletion and 3 substitutions. The reason is that strings where deletions are applied to the smaller string to efficiently get to the larger string can't be achieved when starting with the longer string, because it does not have the complimentary insertion operation (since you're disallowing that).

Your last paragraph is true. If the path from shorter to longer uses only insertions and substitutions, then any allowed path can simply be reversed from the longer to the shorter. Substitutions are the same regardless of direction, but the inserts when going from small to large become deletions when reversed.

I haven't tested this thoroughly, but this modification shows the direction I would take, and appears to work with the values I've tested with it. It's written in c#, and follows the psuedo code in the wikipedia entry for Levenshtein distance. There are obvious optimizations that can be made, but I refrained from doing that so it was more obvious what changes I've made from the standard algorithm. An important observation is that (using your constraints) if the strings are the same length, then substitution is the only operation allowed.

    static int LevenshteinDistance(string s, string t) {
        int i, j;
        int m = s.Length;
        int n = t.Length;

        // for all i and j, d[i,j] will hold the Levenshtein distance between
        // the first i characters of s and the first j characters of t;
        // note that d has (m+1)*(n+1) values
        var d = new int[m + 1, n + 1];

        // set each element to zero
        // c# creates array already initialized to zero

        // source prefixes can be transformed into empty string by
        // dropping all characters
        for (i = 0; i <= m; i++) d[i, 0] = i;

        // target prefixes can be reached from empty source prefix
        // by inserting every character
        for (j = 0; j <= n; j++) d[0, j] = j;

        for (j = 1; j <= n; j++) {
            for (i = 1; i <= m; i++) {
                if (s[i - 1] == t[j - 1])
                    d[i, j] = d[i - 1, j - 1];       // no operation required
                else {
                    int del = d[i - 1, j] + 1;   // a deletion
                    int ins = d[i, j - 1] + 1;   // an insertion
                    int sub = d[i - 1, j - 1] + 1; // a substitution
                    // the next two lines are the modification I've made
                    //int insDel = (i < j) ? ins : del;
                    //d[i, j] = (i == j) ? sub : Math.Min(insDel, sub);
                    // the following 8 lines are a clearer version of the above 2 lines 
                    if (i == j) {
                        d[i, j] = sub;
                    } else {
                        int insDel;
                        if (i < j) insDel = ins; else insDel = del;
                        // assign the smaller of insDel or sub
                        d[i, j] = Math.Min(insDel, sub);
                    }
                }
            }
        }
        return d[m, n];
    }
  • Thank you, I think you've understood exactly what I'm going for. Can you suggest how to adapt the usual dynamic programming algorithm for Levensthein distance to one that does what I'm looking for? The trouble I'm running into is that the standard algorithm involves computing the distance between all sub-strings. However, some of the sub strings of y will be *longer* than certain sub strings of x, and so will be unreachable with only deletions and substitutions. – dextrous Nov 30 '14 at 06:09
  • Thanks you! I'm not able to upvote answers yet, but I've accepted. – dextrous Nov 30 '14 at 18:06
  • I don't mean to be greedy, but do you think you could pseudocode the two lines at the end that are modifications to the original? I don't know C#, but I can follow everything else. – dextrous Nov 30 '14 at 18:09