I've been reviewing Eugene Myers' Diff Algorithm Paper. This is the algorithm that is implemented in the popular diff
program.
On page 12 of the paper, it presents the pseudo-code for the algorithm to find the longest common sub-sequence of A
and B
:
LCS(A, N, B, M)
If N > 0 and M > 0 Then
Find the middle snake and the length of an optimal path for A and B.
Suppose it is from (x, y) to (u, v).
If D > 1 Then
LCS(A[1..x], x, B[1..y], y)
Output A[x+1..u]
LCS(A[u+1..N], N-u, B[v+1..M], M-v)
Else If M > N Then
Output A[1..N].
Else
Output B[1..M].
Suppose A = "A" and B = "B". In this case, N = 1 and M = 1. The middle snake would be (x, y) = (0, 1) and (u, v) = (0, 1) because there are no diagonals. In this case D = 1 because the algorithm has only taken one step.
The algorithm says that the only thing to do in this scenario is to Output B[1..M]
, equal to "B", because N > 0, M > 0, D = 1, and M = N. But this seems wrong, because there is no common sub-sequence between "A" and "B". The paper's commentary that "If D <= 1 then B is obtained from A by either deleting or inserting at most one symbol" is incorrect because "A" must be removed and "B" added.
What am I misinterpreting here?