3

I am doing some work using Levenshtein (edit) distance using dynamic programming. I think I understand the Wagner-Fischer algorithm to do this efficiently. However, it doesn't look like the algorithm is constructive. If I compute that the edit distance between two strings is, e.g., 10, then I would also like to determine a particular sequence of 10 edits that turns one into the other. Can this be done efficiently too? If so, how?

dextrous
  • 55
  • 4

2 Answers2

7

While trying to implement Ante's algorithm I got wrong results, which means it is either wrong or I implemented it in a wrong way. In any case I got it working and here's my more detailed algorithm. See Wagner-Fischer algorithm for a description of d.

  1. Start at cell d(m, n)
  2. Check cells d(m - 1, n - 1), d(m - 1, n) and d(m, n - 1) and determine which one contains the smallest value.
    • If it's d(m - 1, n - 1) (prefer this if it's a tie) then you have either
      • a substitution if d(m - 1, n - 1) < d(m, n). Decrement m and n by one.
      • no operation if d(m - 1, n - 1) == d(m, n). Decrement m and n by one.
    • If it's d(m - 1, n) then you have a deletion. Decrement m by one.
    • If it's d(m, n - 1) then you have an insertion. Decrement n by one.

If any cell lookup would cause negative indexes, just skip them. If you arrive at cell (0, 0) you're done. You will have produced the list of edits in reverse order.

I wrote an implementation in Python that outputs the exact instructions including the characters and offsets involved in each operations. It also includes some tests to validate the output and which also demonstrate the format of the output.

jlh
  • 4,349
  • 40
  • 45
  • This is the same algorithm as described by Ante, but with more detail in how to decide the exact type of edit. – Reuben Morais Feb 19 '20 at 13:55
  • It's possible that I just messed up while implementing it and that Ante's answer is also correct. It's too long ago to really remember what the problem was. In any case I updated the answer with even more details and wrote an implementation in Python. – jlh Feb 19 '20 at 16:22
2

It is very constructive. With resulting matrix it is possible to find all different sequences of edits that produce minimal distance.

To find edits you have to pass resulting matrix in 'backward'. Start from result cell, (m,n).

  • If value of cell (m-1, n-1) is same, than characters on these places are same an no edit is needed.
  • If value of cell (m-1, n-1) is smaller, than find cell(s) from {(m-1, n-1), (m-1, n), (m, n-1)} with the smallest value. Position of cell(s) with the smallest value determines if substitution, deletion or insertion is performed. If there are more cells with the smallest value, than more sequences of edits can produce minimal distance. If you need only one sequence than choose any one of them.

Make same check until path reaches cell (0,0).

Path of checks determines edits performed in reverse order.

Ante
  • 5,350
  • 6
  • 23
  • 46