5

Some other questions here have been about the issue of being able to compress only a part/chunk of a large file of compressed data. Allowing some sort of "random access decompression". Bzip2 has always been among the recommendations for such a feature.

Reading about bzip on Wikipedia and on some document refered to as the informal specification it was not completely clear at what level this feature to separately decompress a part of the bzip2 file occurs. There seems to be two options, a) it is on the level of BzipStreams and b) it is even on the level of StreamBlocks (of which to my understading there can be one or more inside of a BzipStream).

BZipFile:=BZipStream+
└──BZipStream:=StreamHeader StreamBlock* StreamFooter
   ├──StreamHeader:=HeaderMagic Version Level
   ├──StreamBlock:=BlockHeader BlockTrees BlockData
   │  ├──BlockHeader:=BlockMagicBlockCRC Randomized OrigPtr 
   │  └──BlockTrees:=SymMapNumTrees NumSels Selectors Trees
   │     ├──SymMap:=MapL1 MapL2{1,16}
   │     ├──Selectors:=Selector{NumSels}
   │     └──Trees:=(BitLen Delta{NumSyms}{NumTrees}
   └──StreamFooter:=FooterMagic StreamCRCPadding

Albeit the bzip2 is praised often, it seems to me the fact of having the archive data not being byte-aligned, but bit-aligned within each BzipStream, whould suggest that the separate decompression of individual blocks has not been something that was supposed to happen, though I cannot be sure and hence this question :)

Update

A look onto the man bzip2recover manual page tells

bzip2 compresses files in blocks, usually 900kbytes long. Each block is handled independently. If a media or transmission error causes a multi-block .bz2 file to become damaged, it may be possible to recover data from the undamaged blocks in the file.

The compressed representation of each block is delimited by a 48-bit pattern, which makes it possible to find the block boundaries with rea‐ sonable certainty. Each block also carries its own 32-bit CRC, so dam‐ aged blocks can be distinguished from undamaged ones.

which might strongly suggest that each block can be decompressed separately. Is this correct?

humanityANDpeace
  • 4,350
  • 3
  • 37
  • 63
  • The file format does but the standard bzip2 and libbzip2 do not. There are some tools going by the name of seek-bzip2 that can a) decompress an individual block and b) scan a .bz2 file to generate an index of the blocks, which begin at arbitrary bit offsets. The indexing takes the same time as decompressing but the index can be saved and reused. – hippietrail Jul 22 '21 at 13:37
  • 1
    @hippietrail I think you could make this the answer. – RobinP May 09 '22 at 20:35
  • @RobinP I haven't played with this stuff for over a year now but I seem to recall that each block in a bzip2 file is also marked with an ASCII or BCD string of the first n digits of PI and the last block's end is marked with sqrt(PI) in the same format. But there is no info on where this block maps to in the uncompressed file, which requires the decompression code used in seek-bzip2. I added feature requests to a couple of bz2 libs to generate this table while compressing. Don't know if any implemented it. – hippietrail May 10 '22 at 02:14

0 Answers0