4

bzip2 (i.e. this program by Julian Seward)'s lists available block-sizes between 100k and 900k:

 $ bzip2 --help
 bzip2, a block-sorting file compressor.  Version 1.0.6, 6-Sept-2010.

 usage: bzip2 [flags and input files in any order]

   -1 .. -9            set block size to 100k .. 900k

This number corresponds to the hundred_k_blocksize value written into the header of a compressed file.

From the documentation, memory requirements are as follows:

Compression:   400k + ( 8 x block size )

Decompression: 100k + ( 4 x block size ), or
               100k + ( 2.5 x block size )

At the time the original program was written (1996), I imagine 7.6M (400k + 8 * 900k) might have been a hefty amount of memory on a computer, but for today's machines it's nothing.

My question is two part:

1) Would better compression be achieved with larger block sizes? (Naively I'd assume yes). Is there any reason not to use larger blocks? How does the cpu time for compression scale with the size of block?

2) Practically, are there any forks of the bzip2 code (or alternate implementations) that allow for larger block sizes? Would this require significant revision to the source code?

The file format seems flexible enough to handle this. For example ... since hundred_k_blocksize holds an 8-bit character that indicates the block-size, one could extend down the ASCII table to indicate larger block-sizes (e.g. ':' = x3A => 1000k, ';' = x3B => 1100k, '<' = x3C => 1200k, ...).

saladi
  • 3,103
  • 6
  • 36
  • 61

1 Answers1

5

Your intuition that a larger block size should lead to a higher compression ratio is supported by Matt Mahoney's compilation of programs from his large text compression benchmark. For example, the open-source BWT program, BBB, (http://mattmahoney.net/dc/text.html#1640) has a ~40% compression ratio improvement going from a blocksize of 10^6 to 10^9. Between these two values, the compression time doubles. Now that the "xz" program, which uses is an LZ variant (called LZMA2) originally described by 7zip author, Igor Pavlov, is beginning to overtake bzip2 as the default strategy for compressing source code, it is worth studying the possibility of upping bzip2's block size to see if it might be a viable alternative. Also, bzip2 avoided arithmetic coding due to patent restrictions, which have since expired. Combined with the possibility of using the fast asymmetric numeral systems for entropy coding developed by Jarek Duda, a modernized bzip2 could very well be competitive in both compression ratio and speed to xz.

Michael Lee
  • 101
  • 1
  • 3