6

I'm currently developing a Xen backup system, however I have met the following problem:

I have two methods of backing up:

  • doing dd from the LVM snapshot, and tarring it., and rsync it remotely
  • mount the LVM snapshot and rsync everything to a remote location

Now the second option allows me the use of rdiff-backupso I can save incremental backups and save a lot of space whilst the first option is really storage heavy.

Now, I have two questions:

  • Is there any way to have no 'whitespace' when using dd? Let's say I have a 50GB LVM volume and only 3 GB is used, when using dd it will create a 50 GB image (and so 47 GB is wasted). tar can fix this but takes a lot of extra time (which I don't have basically)
  • Can these img files created by dd saved incrementally in some way?
Scott Pack
  • 14,907
  • 10
  • 53
  • 83
Devator
  • 1,473
  • 4
  • 18
  • 37

3 Answers3

7

Compression for blank space

Let's take it back to basics from your snapshot. First, I'm going to ask you to look at why you're tarring up one file. Stop and think about what tar does for a bit and why you're doing that.

$ dd if=/dev/zero of=zero bs=$((1024*1024)) count=2048
2048+0 records in
2048+0 records out
2147483648 bytes transferred in 46.748718 secs (45936739 bytes/sec)
$ time gzip zero

real    1m0.333s
user    0m37.838s
sys     0m1.778s
$ ls -l zero.gz
-rw-r--r--  1 user  group  2084110 Mar 11 16:18 zero.gz

Given that, we can see that the compression gives us about a 1000:1 advantage on otherwise empty space. Compression works regardless of system support for sparse files. There are other algorithms that will tighten it up more, but for raw overall performance, gzip wins.

Unix utilities and sparse files

Given a system with support for sparse files, dd sometimes has an option to save the space. Curiously, my mac includes a version of dd that has a conv=sparse flag, but the HFS+ filesystem doesn't support it. Opposingly, a fresh Debian install I used for testing has support for sparse files in ext4, but that install of dd doesn't have the flag. Go figure.

Thus, another exercise:

I copied /dev/zero into a file the same as above. It took up 2G of space on the filesystem as confirmed by du, df, and ls. Then I used cp on it and found myself with 2 files using 4GB of space. So, it's time to try another flag:

`cp --sparse=always sparse sparse2`

Using that forces cp to take a regular file and use sparse allocation whenever it sees a long string of zeroes. Now I've got 2 files that report as taking up 4GB according to ls, but only 2GB according to du and df.

Now that I've got an sparse file, will cp behave? Yes. cp sparse2 sparse results in having ls show me 2GB of consumed space for each file, but du shows them as taking up zero blocks on the filesystem. Conclusion: some utilities will respect an already sparse file, but most will write the entire thing back out. Even cp doesn't know to turn a written file back to sparse unless you force its hand to try.

Next I created a 1MB file and made it a sparse entry, then tried editing it in vim. Despite only entering a few characters, we're back to using the whole thing. A quick search found similar demonstration: https://unix.stackexchange.com/questions/17572/what-is-the-interaction-of-the-rsync-size-only-and-sparse-options

Sparse conclusions

So my thoughts given all this:

  • Snapshot with LVM
  • Run zerofree against the snapshot
  • Use rsync -S to copy with sparse files resulting
  • If you can't use rsync, gzip your snapshot if you're transporting across the network and then run cp --sparse=always against the unexpanded image to create a sparse copy.

Differential backups

The problem downside with a differential backup on block devices is that things can move around a bit and generate large unwieldy diffs. There is some discussion on StackOverflow: https://stackoverflow.com/questions/4731035/binary-diff-and-patch-utility-for-a-virtual-machine-image that concluded the best use was xdelta. If you are going to do that, again try to zero out your empty space first.

Jeff Ferland
  • 20,547
  • 2
  • 62
  • 85
1

Your two questions...

dd just takes the sectors as an image. There is no way to tell it to skip blank spots; it will create a faithful image of the drive you're duplicating. However, if you redirect the output through a compression utility like zip or 7z the whitespace should shrink it down for nearly the same effect. It will still take time (as the dd utility is still duplicating the white space) but the size factor for storage will be greatly reduced; I have a 100+ gig disk image from VMWare that compresses to around 20 gig due to the unused space.

As for incrementally saving, not to my knowledge. How would dd know what has changed and what hasn't? It wasn't really meant for that. Incremental saves would most likely have to be done with a utility like rdiff-backup or rsync and compressing them, having that done at the file level.

Bart Silverstrim
  • 31,172
  • 9
  • 67
  • 87
0

tar can't fix the wasted space unless it happens to be full of zeros ( it normally won't be ). Running a tool to zero the free space as Jeff suggested would cause the snapshot to cow large amounts of data, taking a lot of time and using up a lot of snapshot backing store space. Is there a reason you don't want to mount the snapshot and rsync or rdiff-backup that? You might also look at dump which can rapidly back up the snapshot without mounting it ( if it is ext[234] ) and do multilevel incremental backups. It can be much faster than tar or rsync for filesystems that have many small files. It can also do multithreaded compression.

psusi
  • 3,347
  • 1
  • 17
  • 9
  • It's indeed all zero's. I cannot mount it as there are other operating systems than Linux (so the filesystem isn't ext[234] explicity either). – Devator May 16 '12 at 07:00