1

So I'm aware that Levenshtein Distance algorithm takes into account the minimum number of deletions, insertions and substitutions required to change a String A into String B. But, I was wondering how you can separately keep track of number of deletions in the total edits required to make the change. I was looking at this implementation of the algorithm,

def levenshtein(first, second)
    first = first.split
    second = second.split
    first_size = first.size
    second_size = second.size
    matrix = [(0..first_size).to_a]
    (1..second_size ).each do |j|
        matrix << [j] + [0] * (first_size)
    end
    count = 0
    (1..second_size).each do |i|
       (1..first_size).each do |j|
         if first[j-1] == second[i-1]
           matrix[i][j] = matrix[i-1][j-1]
         else
           matrix[i][j] = [matrix[i-1][j],matrix[i][j-1], matrix[i-1][j-1]].min + 1
         end
       end
    end
    return matrix.last.last 
end

So in order to keep track of deletions, I tried:

if matrix[i-1[j] == [matrix[i-1][j],matrix[i][j-1], matrix[i-1][j-1]].min

then increase the count. But, this doesn't seem to work. I also tried to get the difference in size for two strings but it fails for the following case

String 1: "my response to prompt#1"
String 2: "my edited response to"

There is clearly 1 deletion here but simply getting the difference in size won't detect so.

I was wondering if anyone knows how to keep track of number of deletions that were involved in the total edits for changing string A into string B.

aronchick
  • 6,786
  • 9
  • 48
  • 75
kchoi
  • 1,205
  • 5
  • 18
  • 32
  • Could you please precise what's a deletion for you ? – Sébastien Oct 10 '14 at 08:30
  • Deleting a word. For example, `"my response to prompt#1", "my response to"`, we have a deletion of `"prompt#1"` – kchoi Oct 10 '14 at 08:38
  • Could you please post a working console example ? It would help going into fixing :) – Sébastien Oct 10 '14 at 10:17
  • It's not much of a fix as much as figuring out how to count the number of deletions in the dp process. – kchoi Oct 10 '14 at 15:54
  • Ideally, it should pass for the following cases: Case 1: `"my response to prompt#1", "my edited response to prompt#1"` should return 1 because you added `edited.` Case 2: `"haha bob", "haha bobber"` should return 1 because you substituted bob with bobber. Case 3: `"my response to prompt#1", "my response to"` should return -1 because there was 1 deletion. So this is what I want to do, return negative number when it is deleting. – kchoi Oct 10 '14 at 15:55

1 Answers1

3

We can make the deletion count ride along with the number of substitutions by making each entry of the table a list comprised of the two quantities. (As a side effect, the secondary optimization goal is to minimize the number of deletions. I don't know whether this is desirable or not.)

def levenshtein(first, second)
    first = first.split
    second = second.split
    first_size = first.size
    second_size = second.size
    matrix = [(0..first_size).to_a]
    (1..second_size ).each do |j|
        matrix << [[j,0]] + [[0,0]] * (first_size)
    end
    count = 0
    (1..second_size).each do |i|
       (1..first_size).each do |j|
         if first[j-1] == second[i-1]
           matrix[i][j] = matrix[i-1][j-1]
         else
           matrix[i][j] = [[matrix[i-1][j  ][0]+1, matrix[i-1][j  ][1]  ],
                           [matrix[i  ][j-1][0]+1, matrix[i  ][j-1][1]+1],
                           [matrix[i-1][j-1][0]+1, matrix[i-1][j-1][1]  ]].min
         end
       end
    end
    return matrix.last.last 
end
David Eisenstat
  • 64,237
  • 7
  • 60
  • 120
  • This seems to fail for `String 1: "the the the" String 2: "the the"` Know why that might be? It should return 1 for matrix.last.last[1] but returns 0 instead. – kchoi Oct 10 '14 at 20:24
  • I added `if second_size < first_size && matrix.last.last[1] == 0 @difference = (first_size - second_size) end` and seemed to solve the problem. – kchoi Oct 10 '14 at 20:41