3

Gusfield (Algorithms on Strings, Trees, and Sequences, Section 11.8.6) describes a dynamic programming algorithm for finding the best alignment between two sequences A and B under the assumption that the penalty assigned to a gap of length x in one of the aligned sequences is of the form R+Sx for constants R and S. In the special case where S=0, there is a penalty for beginning a gap but no penalty for lengthening it. It seems to me that there is an error in Gusfield's formulae and I hope that someone can help me understand the true state of affairs.

Gusfield defines four functions V(i,j), G(i,j), E(i,j) and F(i,j) with the following properties:

  • V(i,j) is the best score possible for alignments of the prefixes A[1..i] and B[1..j].
  • E(i,j) is the best possible score for alignments of these prefixes under the condition that B[j] is paired against a gap in A.
  • F(i,j) is the best possible score for alignments of these prefixes under the condition that A[i] is paired against a gap in B.
  • G(i,j) is the best possible score for alignments that pair A[i] with B[j].

The recurrences for i and j greater than or equal to 1 are:

V(i,j)=max[E(i,j),F(i,j),G(i,j)]
G(i,j)=V(i,j)+w(A[i],B[j]) where w is the score for a match/mismatch.
E(i,j)=max(E(i,j-1),V(i,j-1)-R]
F(i,j)=max(F(i-1,j),V(i-1,j)-R]

Finally the boundary conditions are given as:

V(i,0)=E(i,0)=-R
V(0,j)=F(0,j)=-R

With all that as preliminaries, consider the situation where we have two sequences each of length one, so that A=p and B=q. There are only three possible alignments in this situation:

p    p-    -p
q    -q    q-

These have scores respectively w(p,q), -2R, -2R. In particular we should have E(0,1)=F(1,0)=-2R. However, the recurrences give that E(0,1) and F(1,0) are both bigger than or equal to -R.

This error in the boundary conditions has consequences. Suppose for example that A=pppppp...p and B=qqqq...q with p and q different. The alignment that sets A completely off from B:

A-
-B

should score as -2R (it has two gaps) and so this alignment is eventually optimal assuming w(p,q)<0.

Gusfield's algorithm does not seem to handle this case correctly.

In practical situations this probably doesn't matter, because typical strings have a lot of matches and so the boundary cases don't contribute to the solution.

Comments/criticisms welcome.

1 Answers1

1

Yes, two of the equations are actually incorrect. They should be

F(i,j)=max(F(i,j-1),V(i,j-1)-R] E(i,j)=max(E(i-1,j),V(i-1,j)-R]

Since E(i,j) is the best possible score for alignments of these prefixes under the condition that A[i] is paired against a gap in B, the alignment is made of the optimal alignment of A[1:i-l] against B[1:j] and an l-long gap (the indels are against A[i-l+1:i]).

  • I agree. However, as I understand it, the alignment you've shown should be scored as -2R, and therefore E(1,1) should be -2R. But the recurrence gives E(1,1)=-R. I think that the 1st column and row, as well as the zeroth column and row, need to be set as part of the initialization of the recurrence. – Jeremy Teitelbaum Jun 23 '14 at 14:51
  • Now I have understood your point. I have edited by answer, showing the error in the formulae. – Gianluca Della Vedova Jun 25 '14 at 07:45
  • Anyway, Gusfield's book gives a different definition of E() --- the gap is in A. In that case you should swap the indices in the boundary conditions of E(i,0), F(0,i) – Gianluca Della Vedova Jun 25 '14 at 07:58
  • I had stated Gusfield's algorithm slightly incorrectly as you point out; the text defining E and F were swapped but the recurrences were correct. I edited the question to fix that problem, so now my text is fully consistent with Gusfield. – Jeremy Teitelbaum Jun 25 '14 at 13:20
  • With your new definition of E() and F(), the recurrences in my answer are now correct. – Gianluca Della Vedova Jun 26 '14 at 04:15
  • Let's put it in another way. Restate the boundary conditions as: `V(i,0)=E(i,0)=E(i-1,0)` if i>1, V(0,j)=F(0,j)=F(0,j-1)` if j>1, `V(1,0)=E(1,0)=V(0,1)=E(0,1)=-R`, `V(0,0)=0`. Notice that all E() equations for E (boundary and actual recurrences) are expressed as a function of E(i-1,.), while before the boundary and actual recurrence were not consistent with each other. – Gianluca Della Vedova Jun 26 '14 at 04:21