0

I have a program that takes a file, compresses it using /usr/bin/zip or /bin/gzip or /bin/bzip2, and removes the original if and only if the compress operation completes successfully.

However, this program can be killed (via kill -9), or, in principle, can even crash on its own!

Question: Can I assume that the zipped output file that gets created on disk is always valid, without ever having to decompress it and comparing it with the original?

In other words, no matter the point the compress operation gets ungracefully interrupted at, does the fact that the compressed output file exists on disk imply it's valid?

In other words, are the compress operation and the file creation on disk together an atomic transaction?

The main concern here is not removing the original file if the compressed file is invalid, but without having to undergo the costly decompress and compare operations.

Note:

  1. Ignore OS file buffers not flushing to disk due to UPS failure.

  2. Ignore disk/media related failure. This can happen much later anyway, and quite independently of the program's interruption.

Harry
  • 3,684
  • 6
  • 39
  • 48

1 Answers1

1

A. Yes, if zip, gzip, or bzip2 complete successfully, you can assume that the resulting compressed file is valid with a high probability. Those programs have been around a loooong time, and I would assert that very nearly all data integrity bugs were worked out of them long ago. You also need to consider the reliability of your hardware in its operating environment.

B. (Your "in other words" seem like entirely different questions.) No. An ungracefully interrupted compress operation will generally leave a partial and invalid compressed file behind.

C. No. The file is created and then written to a chunk at a time. Those operations are certainly not atomic.

You just need to verify that the compression utility completed successfully by virtue of it exiting normally and returning zero as the exit code. Then you do not need to examine the compressed file unless you are super paranoid, perhaps because the data has very high value to you.

I should note that verifying the compressed data will take a fraction of the time it takes to compress it, at least for zip and gzip. bzip2 will take about the same amount of time as it took to compress.

Mark Adler
  • 101,978
  • 13
  • 118
  • 158
  • I now feel dumb even asking about **B** - sheesh! **C**, I was aware of anyway. But thanks, regardless. I failed to mention about the `-t` (test integrity option) that `unzip`, `gunzip`, and `bunzip2` all provide. However, not sure if they're all identical to a transparent but full Decompress + Compare, given that `gunzip` and `unzip` may compare only block-level CRCs (and not each byte of the compressed and the original). Forgot my Math, so not sure if CRC is a 100% guarantee, or only a statistical one. (I bet, it's the latter since part cannot represent the whole, can only summarize it.) – Harry Jun 01 '18 at 03:22
  • 1
    Yes, the CRC check done by a `-t` is statistical, with a 2^-32 chance of a false positive. – Mark Adler Jun 01 '18 at 04:59