Record all optimal sequence alignments when calculating Levenshtein distance in Julia

Question

I'm working on the Levenshtein distance with Wagner–Fischer algorithm in Julia.

It would be easy to get the optimal value, but a little hard to get the optimal operation sequence, like insert or deletion, while backtrace from the right down corner of the matrix.

I can record the pointer information of each d[i][j], but it might give me 3 directions to go back to d[i-1][j-1] for substitution, d[i-1][j] for deletion and d[i][j-1] for insertion. So I'm trying to get all combination of the operation sets that gave me the optimal Levenshtein distance.

It seems that I can store one operation set in one array, but I don't know the total number of all combinations as well as there length, so it would be hard for me to define an array to store the operation set during the enumeration process. How can I generate arrays while store the former ones? Or I should use Dataframe?

score 2 · Accepted Answer · answered Nov 17 '15 at 19:31

If you implement the Wagner-Fischer algorithm, at some point, you choose the minimum over three alternatives (see Wikipedia pseudo-code). At this point, you save the chosen alternative in another matrix. Using a statement like:

c[i,j] = indmin([d[i-1,j]+1,d[i,j-1]+1,d[i-1,j-1]+1])
# indmin returns the index of the minimum element in a collection.

Now c[i,j] contains 1,2 or 3 according to deletion, insertion or substitution. At the end of the calculation, you have the final d matrix element achieving the minimum distance, you then follow the c matrix backwards and read the action at each step. Keeping track of i and j allows reading the exact substitution by looking which element was in string1 at i and string2 at j in the current step. Keeping a matrix like c cannot be avoided because at the end of the algorithm, the information about the intermediate choices (done by min) would be lost.

score 0 · Answer 2 · answered Nov 17 '15 at 08:40

I'm not sure that I got your question but anyway, vectors in Julia are dynamic data structures, so you are always able to grow it using appropriate function, e.g pull!() , append!() , preapend!() also its possible to reshape() the result vector to an array of desired size.
but one particular approach for the above case could be obtained using sparse() matrix:

import Base.zero
Base.zero(ASCIIString)=""
module GSparse
  export insertion,deletion,substitude,result
  s=sparse(ASCIIString[])
  deletion(newval::ASCIIString)=begin
    global s
    s.n+=1
    push!(s.colptr,last(s.colptr))
    s[s.m,s.n]=newval
  end
  insertion(newval::ASCIIString)=begin
    global s
    s.m+=1
    s[s.m,s.n]=newval
  end
  substitude(newval::ASCIIString)=begin
    global s
    s.n+=1
    s.m+=1
    push!(s.colptr,last(s.colptr))
    s[s.m,s.n]=newval
  end
  result()=begin
    global s
    ret=zeros(ASCIIString,size(s))
    le=length(s)
    for (i = 1:le)
      ret[le-i+1]=s[i]
    end
    ret
  end
end
using GSparse
insertion("test");
insertion("testo");
insertion("testok");
substitude("1estok");
deletion("1stok");
result()

I like the approach because for large texts you could have many zero elements. also I fill data structure in forward way and create results by reversing.

Thank you too! Problem solved~I add a new vector to store the place where my computation has finished in the last iteration. — Catcoin7, Nov 23 '15 at 04:05

Record all optimal sequence alignments when calculating Levenshtein distance in Julia

2 Answers2