Store multiple versions of large binary file with minimal data duplication (preferably Linux)

Question

I need to store multiple versions of a ~ 150 GB binary file (qcow2) on Linux servers with local storage, and was hoping there is some solution that involves just keeping diffs that can be merged as needed, so that I dont have to create another copy of A 150 GB file when only 4 Gigs have changed. This is a storage question, not a question about KVM/qcow2 specific features. I have already explored some of those options. Currently using CentOS 6.3 with EXT4. The files will need to be stored indefinitely and must be completely intact when restored. I am willing to change filesystem etc if a solution is worth it.

Using overlays and just backing up the overlays, keeping the base image read only, internal and external snapshots... — , Sep 23 '13 at 18:08
What about using SVN or Git? If it is on a server dedicated for just this purpose and a dedicated repo. — , Sep 23 '13 at 18:09
I am considering looking into git-annex or boar to version control the files. Any pertinent info would be cool. — , Sep 23 '13 at 18:18

score 2 · Answer 1 · answered Oct 07 '13 at 20:19

2

ZFS on Linux with deduplication may be your friend in this case. There are Red Hat RPMs/repos available for installation.

Even without dedupe, if you can work this into the ZFS snapshotting workflow, there are some significant advantages to attempting this with ZFS.

Can you explain a bit more about how you wish to work with these files? Are you seeking point-in-time snapshots, or copying multiple revisions of the same/similar files to the datastore?

answered Oct 07 '13 at 20:19

ewwhite

197,159
92
443
809

Multiple revisions which can be accessed independently of one another – Oct 07 '13 at 20:29
Yes, then ZFS snapshots/clones are what you want, as you'll have read/write access to your intermediate revisions. – ewwhite Oct 07 '13 at 20:43

score 0 · Answer 2 · answered Oct 07 '13 at 20:13

I'd be looking at LVM snapshots as a solution. Without going into much details, I'd do this:

Create LVM volume large enough to contain your data.
Upload initial copy of your large binary file to this volume.
Create LVM snapshot.
Use rsync to copy another version of a large file in place of existing file.

At this point you can access original file by mounting LVM snapshot. Also the latest version of the large file will be available. You can create multiple snapshots this way.

score 0 · Answer 3 · answered Oct 07 '13 at 21:43

I'm using librsync for this purpose. It is available for CentOS and other RHEL clones in EPEL repository.

Just use:

rdiff signature new.qcow2 /tmp/new.qcow2.rdiffsig
rdiff delta /tmp/new.qcow2.rdiffsig old.qcow2 new.qcow2--old.qcow2.rdiff
rm /tmp/new.qcow2.rdiffsig
xz new.qcow2--old.qcow2.rdiff

You can then delete old.qcow2. When you'd need it again you'd do:

xz -d < new.qcow2--old.qcow2.rdiff.xz > /tmp/new.qcow2--old.qcow2.rdiff
rdiff patch new.qcow2 /tmp/new.qcow2--old.qcow2.rdiff old.qcow2
rm /tmp/new.qcow2--old.qcow2.rdiff

This can be chained - you can create a rdiff from old.qcow2 to evenolder.qcow2 and so on. This is rather slow but very space efficient - I generally don't ever need to delete old backups using this. There's also a rdiff-backup program, which automates similar scheme for whole directories.

Store multiple versions of large binary file with minimal data duplication (preferably Linux)

3 Answers3