0

Recently I switched from zip to bz2 for compressing nightly database dumps. The command I'm using is tar cj. The old zip files would always differ ever so slightly in size from day to day:

-rw-r--r--  1 mysql mysql 1192139 Aug 20 22:00 mysql_full_export.Fri.zip
-rw-r--r--  1 mysql mysql 1192425 Aug 23 22:00 mysql_full_export.Mon.zip
-rw-r--r--  1 mysql mysql 1192140 Aug 21 22:00 mysql_full_export.Sat.zip
-rw-r--r--  1 mysql mysql 1192145 Aug 22 22:00 mysql_full_export.Sun.zip
-rw-r--r--  1 mysql mysql 1192137 Aug 19 22:00 mysql_full_export.Thu.zip
-rw-r--r--  1 mysql mysql 1192403 Aug 24 22:00 mysql_full_export.Tue.zip
-rw-r--r--  1 mysql mysql 1186645 Aug 25 22:00 mysql_full_export.Wed.zip

Whereas the new bz2 files show identical file sizes over the last week:

-rw-r--r--  1 mysql mysql 972800 Oct  1 22:00 mysql_full_export.Fri.bz2
-rw-r--r--  1 mysql mysql 972800 Oct  4 22:00 mysql_full_export.Mon.bz2
-rw-r--r--  1 mysql mysql 972800 Oct  2 22:00 mysql_full_export.Sat.bz2
-rw-r--r--  1 mysql mysql 972800 Oct  3 22:00 mysql_full_export.Sun.bz2
-rw-r--r--  1 mysql mysql 972800 Oct  7 22:00 mysql_full_export.Thu.bz2
-rw-r--r--  1 mysql mysql 972800 Oct  5 22:00 mysql_full_export.Tue.bz2
-rw-r--r--  1 mysql mysql 972800 Oct  6 22:00 mysql_full_export.Wed.bz2

Is this normal for bz2 if the compressed files differ only slightly in size? This database hardly changes but it does change a little bit as you can see from the zip file sizes.

Follow-up:

The answer marked correct below seems the best explanation. The suggestion to calculate an md5 checksum was also helpful as it confirmed that the files are indeed different:

$ md5sum *.bz2
7bec25e80644645e6b2d5b417bb4627d  mysql_full_export.Fri.bz2
9cca30e7ed4fb536976ef9d8705e0466  mysql_full_export.Mon.bz2
bc9b9cd1e5a5e552811bff80192b1b43  mysql_full_export.Sat.bz2
7ebbed98f7153a6cafe61836d9a6440d  mysql_full_export.Sun.bz2
ad1af98a0ecf90bef1dc1c0b3dedb101  mysql_full_export.Thu.bz2
b399d30e03c200c1ad03bde391e5e682  mysql_full_export.Tue.bz2
b14b4d1bb22ef39b9ebc2f668a2f520d  mysql_full_export.Wed.bz2
nw.
  • 723
  • 2
  • 8
  • 12

3 Answers3

1

Perhaps there is a bug in the script archive. Compare files:

cmp mysql_full_export.Wed.bz2 mysql_full_export.Tue.bz2

Compare the contents of archives(use diff or cmp).

bindbn
  • 5,211
  • 2
  • 26
  • 24
1

In the directory containing your bz2 files paste this command:

for file in *.bz2;do echo "checksum for ${file/.bz2/}: $(bunzip2 -c $file|md5sum)";done

If the checksums all differ then the uncompressed files are different.

ThatGraemeGuy
  • 15,473
  • 12
  • 53
  • 79
  • You can get the same result by running md5sum against the compressed files, too (two identical compressed inputs will never expand to different outputs when decompressed with the same algorithm). Uncompressing them first just slows down the checks. Easiest check would just be: `md5sum *.bz2` – Christopher Cashell Oct 08 '10 at 15:30
  • both approaches work well; `md5sum *.bz2` outputs more cleanly – nw. Oct 08 '10 at 17:44
  • Yeah, I have nothing to say for myself..... I guess it was a long week. :-o – ThatGraemeGuy Oct 09 '10 at 12:47
1

Another thought is that the tar file format is always aligned on a 512 byte boundary, it pads it out with NUL characters if it's shorter (per file).

Now granted, the tar should be being done before the bz2, so it should still be varying in size (theoretically). But perhaps it's compressing first and then putting it into the tar, causing it to be aligned to the 512 byte boundary?

miquella
  • 250
  • 1
  • 5
  • 10
  • sounds plausible... 972800 is exactly 1900 * 512. and `bunzip2` always reports `trailing garbage after EOF ignored`. – nw. Oct 08 '10 at 17:49