3

I'm trying to convey the difference between two bytestreams. I want to minimize the number of bytes in the patch.

(I don't necessarily want to minimize the number of "changes" in the diff, which is what the optimal patch in a levenshtein distance computation would give me.)

The patch would ideally be in a format such that, given the source bytestream and the diff, it would be easy to reconstruct the target bytestream.

Is there a good algorithm for doing this?


Edit: For the record, I've tried sending changes of the form "at spot 506, insert the following the bytes...", where I create a change list from the levenshtein distance algorithm.

The problem I have is that the levenshtein distance algorithm gives me a lot of changes like:

at spot 506 substitute [some bytes1]
at spot 507 do nothing
at spot 508 substitute [some bytes2]
at spot 509 do nothing
at spot 510 substitute [some bytes3]
...

This is because the lev distance algorithm tries to minimize the number of changes. However, for my purposes this instruction set is wasteful. It would probably be better if an algorithm just said,

At spot 506 substitute [some bytes1, [byte at spot 507], some bytes2, [byte at spot 509], some bytes3, ...]

There's probably some way to modify lev distance to favor these types of changes but it seems a little tricky. I could coalesce substituions after getting a changelist (and I'm going to try that) but there may be opportunities to coalesce deletions / inserts too, and it's less obvious how to do that correctly.

Just wondering if there's a special purpose algorithm for this (or if somebody's done a modification of lev distance to favor these types of changes already).

user1435114
  • 193
  • 1
  • 11

4 Answers4

2

You can do this using pairwise alignment with affine gap costs, which takes O(nm) time for two strings of lengths n and m respectively.

One thing first: There is no way to find a provably minimal patch in terms of bits or bytes used. That's because if there was such a way, then the function shortest_patch(x, y) that calculates it could be used to find a provably minimal compression of any given string s by calling it with shortest_patch('', s), and Kolmogorov complexity tells us that the shortest possible compression of a given string is formally uncomputable. But if edits tend to be clustered in space, as it seems they are here, then it's certainly possible to find smaller patches than those produced using the usual Levenshtein distance algorithm.

Edit scripts

Patches are usually called "edit scripts" in CS. Finding a minimal (in terms of number of insertions plus number of deletions) edit script for turning one string x into another string y is equivalent to finding an optimal pairwise alignment in which every pair of equal characters has value 0, every pair of unequal characters has value -inf, and every position in which a character from one string is aligned with a - gap character has value -1. Alignments are easy to visualise:

st--ing    st-i-ng
stro-ng    str-ong

These are 2 optimal alignments of the strings sting and strong, each having cost -3 under the model. If pairs of unequal characters are given the value -1 instead of -inf, then we get an alignment with cost equal to the Levenshtein distance (the number of insertions, plus the number of deletions, plus the number of substitutions):

st-ing    sti-ng
strong    strong

These are 2 optimal alignments under the new model, and each has cost -2.

To see how these correspond with edit scripts, we can regard the top string as the "original" string, and the bottom string as the "target" string. Columns containing pairs of unequal characters correspond to substitutions, the columns containing a - in the top row correspond to insertions of characters, and the columns containing a - in the bottom row correspond to deletions of characters. You can create an edit script from an alignment by using the "instructions" (C)opy, (D)elete, (I)nsert and (S)ubstitute. Each instruction is followed by a number indicating the number of columns to consume from the alignment, and in the case of I and S, a corresponding number of characters to insert or replace with. For example, the edit scripts for the previous 2 alignments are

C2, I1"r", S1"o", C2     and     C2, S1"r", I1"o", C2

Increasing bunching

Now if we have strings like mississippi and tip, we find that the two alignments

mississippi
------tip--

mississippi
t---i----p-

both have the same score of -9: they both require the same total number of insertions, deletions and substitutions. But we much prefer the top one, because its edit script can be described much more succinctly: D6, S1"t", C2, D2. The second's edit script would be S1"t", D3, C1, D4, C1, D1.

In order to get the alignment algorithm to also "prefer" the first alignment, we can adjust gap costs so that starting a blocks of gaps costs more than continuing an existing block of gaps. If we make it so that a column containing a gap costs -2 instead of -1 when the preceding column contains no gap, then what we are effectively doing is penalising the number of contiguous blocks of gaps (since each contiguous block of gaps must obviously have a first position). Under this model, the first alignment above now costs -11, because it contains two contiguous blocks of gaps. The second alignment now costs -12, because it contains three contiguous blocks of gaps. IOW, the algorithm now prefers the first alignment.

This model, in which every aligned position containing a gap costs g and the first position in any contiguous block of gap columns costs g + s, is called the affine gap cost model, and an O(nm) algorithm was given for this by Gotoh in 1982: http://www.genome.ist.i.kyoto-u.ac.jp/~aln_user/archive/JMB82.pdf. Increasing the gap-open cost s will cause aligned segments to bunch together. You can play with the various cost parameters until you get alignments (corresponding to patches) that empirically look about right and are small enough.

j_random_hacker
  • 50,331
  • 10
  • 105
  • 169
1

There are two approaches to solving this kind of problem:

1) Establish a language for X (edit scripts, in this case), and figure out how to minimize the length of the applicable sentence; or,

2) Compute some kind of minimum representation for Y (string differences), and then think up a way to represent that in the shortest form.

The Myers paper demonstrates that for a particular language, finding the minimum set of changes and finding the minimum length of the change representation are the same problem.

Obviously, changing the language might invalidate that assumption, and certain changes might be extremely complicated to apply correctly (for example, suppose the language included the primitive kP which means to remove the next k characters whose indices are prime. For certain diffs, using that primitive might turn out to be a huge win, but the applications are probably pretty rare. It's an absurd example, I know, but it demonstrates the difficulty of starting with a language.

So I propose starting with the minimum change list, which identifies inserts and deletes. We translate that in a straightforward way to a string of commands, of which there are exactly three. There are no indices here. The idea is that we start with a cursor at the beginning of the original string, and then execute the commands in sequence. The commands are:

=  Advance the cursor without altering the character it points to
Ic Insert the character `c` before the cursor.
D  Delete the character at the cursor.

Although I said there were exactly three commands, that's not quite true; there are actually A+2 where A is the size of the alphabet.

This might result in a string like this:

=========================IbIaInIaInIaDD=D=D============================

Now, let's try to compress this. First, we run-length encode (RLE), so that every command is preceded by a repeat count, and we drop the trailing =s

27=1Ib1Ia1In1Ia1In1Ia2D1=1D1=1D

(In effect, the RLE recreates indices, although they're relative instead of absolute).

Finally, we use zlib to compress the resulting string. I'm not going to do that here, but just to give some idea of the sort of compression it might come up with:

27=1Ib1Ia1In||2D1=1D|
      ______+|  ____+
      ___<---+

(Trying to show the back-references. It's not very good ascii art, sorry.)

Liv-Zempell is very good at finding and optimizing unexpected repetitions. In fact, we could have just used it instead of doing the intermediate RLE step, but experience shows that in cases where RLE is highly effective, it's better to LZ the RLE than the source. But it would be worth trying both ways to see what's better for your application.

rici
  • 234,347
  • 28
  • 237
  • 341
0

A common approach to this that uses very few bytes (though not necessarily the theoretical optimal number of bytes) is the following:

  1. Pad the bytes with some character (perhaps zero) until they have the same lengths.
  2. XOR the two streams together. This will result in a byte stream that is zero everywhere the bytes are the same and nonzero otherwise.
  3. Compress the XORed stream using any compression algorithm, perhaps something like LZW.

Assuming that the patch you have is a localized set of changes to a small part of the file, this will result in a very short patch, since the bulk of the file will be zeros, which can be efficiently compressed.

To apply the patch, you just decompress the XORed string and then XOR it with the byte stream to patch. This computes

Original XOR (Original XOR New) = (Original XOR Original) XOR New = New

Since XOR is associative and self-inverting.

Hope this helps!

templatetypedef
  • 362,284
  • 104
  • 897
  • 1,065
  • Hmm. I don't know how good the compression algorithms are, but my bytestreams are on average 1k bytes. Many pairs involve just a tiny change (in the range of 1 to 5 bytes). The xor'd string would be 1k in length ... Assuming 99% compression I'm still using 10 bytes to convey a 1 byte change. Seems like we could do better than that with an instruction of the form, "at spot 506 insert 0101010." (Maybe this would take 5 bytes) I need to convey a lot of patches so this difference actually matters. (Also, 99% compression may be optimistic.) – user1435114 Dec 08 '12 at 19:28
  • I updated my original answer to better reflect what I've done so far. – user1435114 Dec 08 '12 at 19:35
0

There is a new promising approach to change detection. The sequence alignment problem is considered to be an abstract model for changes detection in collaborative text editing designed to minimize the probability of merge conflict. A new cost function is defined as the probability of intersection between detected changes and random string. The result should be more similar to patch length minimization then other known approaches. It avoids both the known shortcomings of LCS and others approaches. The cubic algorithm has been proposed. http://psta.psiras.ru/read/psta2015_1_3-10.pdf