4

I recently backed up my soon-to-expire university home directory by sending it as a tar stream and compressing it on my end: ssh user@host "tar cf - my_dir/" | bzip2 > uni_backup.tar.bz2.

This got me thinking: I only know the basics of how compression works, but I would imagine that this ability to compress a stream of data would lead to poorer compression since the algorithm needs to finish handling a block of data at one point, write this to the output stream and continue to the next block.

Is this the case? Or do these programs simply read a lot of data into memory compress this, write it, and then do this over again? Or are there any clever tricks used in these “stream compressors”? I see that both bzip2 and xz's man pages talk about memory usage, and man bzip2 also hints to the fact that little is lost on chopping the data to be compressed into blocks:

Larger block sizes give rapidly diminishing marginal returns. Most of the compression comes from the first two or three hundred k of block size, a fact worth bearing in mind when using bzip2 on small machines. It is also important to appreciate that the decompression memory requirement is set at compression time by the choice of block size.

I would still love to hear if other tricks are used, or about where I can read more about this.

beta
  • 2,380
  • 21
  • 38
  • 2
    Good question; I'd just like to point out that piping a stream of data through a compression program like `bzip2` doesn't necessarily mean that small blocks of data are being compressed and sent along in real time. You could easily have a compression utility that eats all data sent to it until an EOF is reached, and only then compresses it and sends it along. – Rag Aug 22 '11 at 17:51
  • 1
    I would not expect bzip2 to analyze several gigs of data before starting to write the first bytes of output. That might save a few bytes in the final size, but we all want performance, too. But yes, I like your question, too. – Christopher Creutzig Aug 22 '11 at 17:57

1 Answers1

6

This question relates more to buffer handling than compression algorithm, although a bit could be said about it too.

Some compression algorithm are inherently "block based", which means they absolutely need to work with blocks of specific size. This is the situation of bzip2, which block size is selected thanks to the "level" switch, from 100kb to 900kb. So, if you stream data into it, it will wait for the block to be filled, and start compressing this block when it's full (alternatively, for the last block, it will work with whatever size it receives).

Some other compression algorithm can handle streams, which means they can continuously compress new data using older one kept in a memory buffer. Algorithms based on "sliding windows" can do it, and typically zlib is able to achieve that.

Now, even "sliding window" compressors may nonetheless select to cut input data into blocks, either for easier buffer management, or to develop multi-threading capabilities, such as pigz.

Cyan
  • 13,248
  • 8
  • 43
  • 78