You can do this using pairwise alignment with affine gap costs, which takes O(nm) time for two strings of lengths n and m respectively.
One thing first: There is no way to find a provably minimal patch in terms of bits or bytes used. That's because if there was such a way, then the function shortest_patch(x, y)
that calculates it could be used to find a provably minimal compression of any given string s
by calling it with shortest_patch('', s)
, and Kolmogorov complexity tells us that the shortest possible compression of a given string is formally uncomputable. But if edits tend to be clustered in space, as it seems they are here, then it's certainly possible to find smaller patches than those produced using the usual Levenshtein distance algorithm.
Edit scripts
Patches are usually called "edit scripts" in CS. Finding a minimal (in terms of number of insertions plus number of deletions) edit script for turning one string x
into another string y
is equivalent to finding an optimal pairwise alignment in which every pair of equal characters has value 0, every pair of unequal characters has value -inf, and every position in which a character from one string is aligned with a -
gap character has value -1. Alignments are easy to visualise:
st--ing st-i-ng
stro-ng str-ong
These are 2 optimal alignments of the strings sting
and strong
, each having cost -3 under the model. If pairs of unequal characters are given the value -1 instead of -inf, then we get an alignment with cost equal to the Levenshtein distance (the number of insertions, plus the number of deletions, plus the number of substitutions):
st-ing sti-ng
strong strong
These are 2 optimal alignments under the new model, and each has cost -2.
To see how these correspond with edit scripts, we can regard the top string as the "original" string, and the bottom string as the "target" string. Columns containing pairs of unequal characters correspond to substitutions, the columns containing a -
in the top row correspond to insertions of characters, and the columns containing a -
in the bottom row correspond to deletions of characters. You can create an edit script from an alignment by using the "instructions" (C)opy, (D)elete, (I)nsert and (S)ubstitute. Each instruction is followed by a number indicating the number of columns to consume from the alignment, and in the case of I and S, a corresponding number of characters to insert or replace with. For example, the edit scripts for the previous 2 alignments are
C2, I1"r", S1"o", C2 and C2, S1"r", I1"o", C2
Increasing bunching
Now if we have strings like mississippi
and tip
, we find that the two alignments
mississippi
------tip--
mississippi
t---i----p-
both have the same score of -9: they both require the same total number of insertions, deletions and substitutions. But we much prefer the top one, because its edit script can be described much more succinctly: D6, S1"t", C2, D2
. The second's edit script would be S1"t", D3, C1, D4, C1, D1
.
In order to get the alignment algorithm to also "prefer" the first alignment, we can adjust gap costs so that starting a blocks of gaps costs more than continuing an existing block of gaps. If we make it so that a column containing a gap costs -2 instead of -1 when the preceding column contains no gap, then what we are effectively doing is penalising the number of contiguous blocks of gaps (since each contiguous block of gaps must obviously have a first position). Under this model, the first alignment above now costs -11, because it contains two contiguous blocks of gaps. The second alignment now costs -12, because it contains three contiguous blocks of gaps. IOW, the algorithm now prefers the first alignment.
This model, in which every aligned position containing a gap costs g and the first position in any contiguous block of gap columns costs g + s, is called the affine gap cost model, and an O(nm) algorithm was given for this by Gotoh in 1982: http://www.genome.ist.i.kyoto-u.ac.jp/~aln_user/archive/JMB82.pdf. Increasing the gap-open cost s will cause aligned segments to bunch together. You can play with the various cost parameters until you get alignments (corresponding to patches) that empirically look about right and are small enough.