Efficient incremental backup with data chunks

Question

I'm developing a backup tool and I can't figure out the most efficient way to do remote backup. I don't want to send the whole file every time there's a small change so I guess incremental backup is the solution. This is all well and good but now I'm stuck with a problem that how can I chunk one file into multiple parts.

The problem is that let's say I have a simple text file and one chunk is one line:

First line
Second line
Third line
Fourth line

Now I have 4 chunks. If I update the second line to let's say "THE second line", now I only need to backup the second chunk.

But what if something like this happens:

First line
First and half line
Second line
Third line
Fourth line

Now that I added "First and half line", every line is now in a different place. So if each line is one chunk, it looks like that every chunk after the first has changed even the content is the same.

Is there any simple solution for this? First I thought that I could do hash of each chunk and then just create "catalog" that would indicate the correct chunk order. This way I could match easily if the chunk exists already with the hash. However I realized that hash table solution wouldn't work with anything else than with files where chunks can be predicted and fixed. For example with binary files you are pretty much limited with fixed byte sized chunks so if there was more data added in the beginning and you started chopping it down to let's say 100k chunks, you would get different data in the later chunks than before.

Any solutions?

Maybe this is helpful: http://chunksync.florz.de/ ChunkSync allows you to create space-efficient incremental backups of large files or block devices (encrypted disks, in particular) by splitting the data into a directory structure of chunk files which get hard-linked into new backup generations in case the contents of the respective chunk haven't changed — Thilo, Apr 24 '13 at 03:02

score 0 · Answer 1 · answered Apr 20 '12 at 13:09

Programs like "diff" or "rsync" solve this problem in their own way.

The basic algorithm requires you pick a "modification window" (its size depends on available memory and time, longer windows require longer matching efforts), and when an old and new hash for the same block don't match, you actually try to match with the next blocks within the given window. You need a more generalized algorithm to also handle block removals (you could actually try to match at +/- half-window for instance).

Rsync (http://rsync.samba.org/) does this incremental backup job in both a disk and network I/O efficient way, and is much more sophisticated than this simple hash matching. It required the author, Andrew Tridgell, several years and a dedicated master thesis to design the algorithms and the protocol. If you don't have 3 years to spare on this, try reading the papers ! Have fun : http://samba.org/~tridge/phd_thesis.pdf

score 0 · Answer 2 · answered Sep 12 '12 at 06:55

You can use rsync to sync yesterdays backup to todays folder then run rysnc to sync only the updated files.

#!/bin/sh
# Create a Backup of Today
mkdir -p /storage/backups/`date +\%Y-\%m-\%d`-`date +\%A`/$host/$username
rsync -avz /storage/backups/`date --date=yesterday +\%Y-\%m-\%d`-`date --date=yesterday +\%A`/$host/$username/ /storage/backups/`date +\%Y-\%m-\%d`-`date +\%A`/$host/$username/
rsync -avz -e ssh --delete --exclude='logs' tim@tim.tim.net:/home/tim/ /storage/backups/`date +\%Y-\%m-\%d`-`date +\%A`/$host/$username/

Efficient incremental backup with data chunks

2 Answers2