Streaming File Delta Encoding/Decoding

Question

Here's the problem - I want to generate the delta of a binary file (> 1 MB in size) on a server and send the delta to a memory-constrained (low on RAM and no dynamic memory) embedded device over HTTP. Deltas are preferred (as opposed to sending the full binary file from the server) because of the high cost involved in transmitting data over the wire.

Trouble is, the embedded device cannot decode deltas and create the contents of the new file in memory. I have looked into various binary delta encoding/decoding algorithms like bsdiff, VCDiff etc. but was unable to find libraries that supported streaming.

Perhaps, rather than asking if there are suitable libraries out there, are there alternate approaches I can take that will still solve the original problem (send minimal data over the wire)? Although it would certainly help if there are suitable delta libraries out there that support streaming decode (written in C or C++ without using dynamic memory).

Do you control the software on both the server and the embedded device? Does the embedded device have a copy of the file to begin with? Where does it keep it? (If the file is >1MB, it's unlikely to be keeping it in RAM!) What does the embedded system need to do with the file? — Dave M., Mar 06 '17 at 04:24
Yes I control the software on both ends. And yes it does have a copy of the original (reference) file. It will have to stream it onto a file on the file system because of memory limitations. The embedded device will need to create a new file by 'patching' the original (reference) file. — thegreendroid, Mar 06 '17 at 08:57
Also, the application for these deltas is to reduce the cost of OTA upgrades for the embedded device. — thegreendroid, Mar 06 '17 at 09:05
Are the target files fixed in size? Or can they also grow/shrink as their contents change? — Andrew Henle, Mar 06 '17 at 11:56
How much locality do you expect in deltas? Do you want just to change some bytes or do you want to add and delete bytes? — vguberinic, Mar 06 '17 at 12:54
Can you send a diff hunk by hunk (one hunk per diff) and apply them sequentially? Hopefully, no hunk is too large. — YSC, Mar 06 '17 at 15:30
@AndrewHenle Target files will grow/shrink as their contents change. — thegreendroid, Mar 06 '17 at 19:43
@vguberinic Generally I expect changes to be spread out and somewhat random, so bytes will be changed and also added/deleted. — thegreendroid, Mar 06 '17 at 19:45
@YSC I had considered diffing the file in chunks (say 64 KB chunks) and then putting a proprietary protocol in place to check if a chunk is different (using hashes, similar to Remote Differential Compression). But this involves a lot of work on the server side which is not desirable. Your approach sounds plausible, I will experiment with the diff libraries and report back. Although I am unsure if the binary diff libraries like bsdiff support hunk by hunk output. — thegreendroid, Mar 06 '17 at 20:04
My first attempt would be to use a simple diff algorithm which just solves the common subsequence problem (https://en.wikipedia.org/wiki/Longest_common_subsequence_problem). Generating a diff from this remains sequential, which is a good quality if you want to stream the diff. — CodeMonkey, Mar 07 '17 at 13:27
If you don't have to access the file often (or can tolerate a modest amount of delay when you do), you could just store the diffs along with the original file, and implement on-the-fly patching in a file-reading shim on the embedded device. The shim would provide a standard `read( buf, len )` interface, but would fill in the buffer by going first to the original file, then through each diff that affects that part of the file, modifying the return buffer appropriately. For overwrite-type diffs, that's easy. However, for deletions and insertions, it would be complicated. — Dave M., Mar 07 '17 at 17:05
How much changes do you expect? I mean what % of the file will change over each delta? — Ajay Brahmakshatriya, Mar 08 '17 at 05:05
@AjayBrahmakshatriya Like I mentioned previously, the file could be completely different or only a few bytes could have changed. It depends. — thegreendroid, Mar 08 '17 at 22:07
Have you looked at [Xdelta](http://xdelta.org/xdelta3-api-guide.html)? — Roman Khimov, Mar 09 '17 at 08:31
@RomanKhimov Yes I have, it looked promising at first but realised it's not suitable for our embedded system (uses dynamic memory and is also quite a big(ish) library). — thegreendroid, Mar 09 '17 at 19:44

score 10 · Answer 1 · answered Mar 09 '17 at 18:51

Maintain a copy on the server of the current file as held by the embedded device. When you want to send an update, XOR the new version of the file with the old version and compress the resultant stream with any sensible compressor. (Algorithms which allow high-cost encoding to allow low-cost decoding would be particularly helpful here.) Send the compressed stream to the embedded device, which reads the stream, decompresses it on the fly and XORs directly (a copy of) the target file.

If your updates are such that the file content changes little over time and retains a fixed structure, the XOR stream will be predominantly zeroes, and will compress extremely well: number of bytes transmitted will be small, effort to decompress will be low, memory requirements on the embedded device will be minimal. The further your model is from these assumptions, the less this approach will gain you.

This is the most simple and quite a sensible approach, thanks! — thegreendroid, Mar 12 '17 at 21:04

score 4 · Answer 2 · answered Mar 08 '17 at 23:23

Since you said the delta could be arbitrarily random (from zero delta to a completely different file), compression of the delta may be a lost cause. Lossless compression of random binary data is theoretically impossible. Also, since the embedded device has limited memory anyway, using a sophisticated -and therefore computationally expensive- library for compression/decompression of the occasional "simple" delta will probably be infeasible.

I would recommend simply sending the new file to the device in raw byte format, and overwriting the existing old file.

score 3 · Answer 3 · answered Mar 10 '17 at 03:24

As Kevin mentioned, compressing random data should not be your goal. A few more comments about the type of data your working with would be helpful. Context is key in compression.

You used the term image which makes it sound like the classic video codec challenge. If you've ever seen weird video aliasing effects that impact the portion of the frame that has changed, and then suddenly everything clears up. You've likely witnessed the notion of a key frame along with a series of delta frames. Where the delta frames were not properly applied.

In this model, the server decides what's cheaper:

complete key frame
delta commands

The delta commands are communicated as a series of write instructions that can overlay the clients existing buffer.

Example Format:

[Address][Length][Repeat][Delta Payload]
[Address][Length][Repeat][Delta Payload]
[Address][Length][Repeat][Delta Payload]

There are likely a variety of methods for computing these delta commands. A brute force method would be:

Perform Smith Waterman between two images.
Compress the resulting transform into delta commands.

Streaming File Delta Encoding/Decoding

3 Answers3