I'm developing a backup tool and I can't figure out the most efficient way to do remote backup. I don't want to send the whole file every time there's a small change so I guess incremental backup is the solution. This is all well and good but now I'm stuck with a problem that how can I chunk one file into multiple parts.
The problem is that let's say I have a simple text file and one chunk is one line:
First line
Second line
Third line
Fourth line
Now I have 4 chunks. If I update the second line to let's say "THE second line", now I only need to backup the second chunk.
But what if something like this happens:
First line
First and half line
Second line
Third line
Fourth line
Now that I added "First and half line", every line is now in a different place. So if each line is one chunk, it looks like that every chunk after the first has changed even the content is the same.
Is there any simple solution for this? First I thought that I could do hash of each chunk and then just create "catalog" that would indicate the correct chunk order. This way I could match easily if the chunk exists already with the hash. However I realized that hash table solution wouldn't work with anything else than with files where chunks can be predicted and fixed. For example with binary files you are pretty much limited with fixed byte sized chunks so if there was more data added in the beginning and you started chopping it down to let's say 100k chunks, you would get different data in the later chunks than before.
Any solutions?