bzip2
(i.e. this program by Julian Seward)'s lists available block-sizes between 100k and 900k:
$ bzip2 --help
bzip2, a block-sorting file compressor. Version 1.0.6, 6-Sept-2010.
usage: bzip2 [flags and input files in any order]
-1 .. -9 set block size to 100k .. 900k
This number corresponds to the hundred_k_blocksize
value written into the header of a compressed file.
From the documentation, memory requirements are as follows:
Compression: 400k + ( 8 x block size )
Decompression: 100k + ( 4 x block size ), or
100k + ( 2.5 x block size )
At the time the original program was written (1996), I imagine 7.6M (400k + 8 * 900k) might have been a hefty amount of memory on a computer, but for today's machines it's nothing.
My question is two part:
1) Would better compression be achieved with larger block sizes? (Naively I'd assume yes). Is there any reason not to use larger blocks? How does the cpu time for compression scale with the size of block?
2) Practically, are there any forks of the bzip2 code (or alternate implementations) that allow for larger block sizes? Would this require significant revision to the source code?
The file format seems flexible enough to handle this. For example ... since hundred_k_blocksize
holds an 8-bit character that indicates the block-size, one could extend down the ASCII table to indicate larger block-sizes (e.g. ':'
= x3A
=> 1000k
, ';'
= x3B
=> 1100k
, '<'
= x3C
=> 1200k
, ...).