0

I am looking at making a file format that interleaves two types of chunks of raw bytes.

One chunk will contain a block of bzip2-compressed data, which has a header containing the usual bzip2 magic number (BZh9).

The second chunk will consist of the other data of interest, which has a header containing a different magic number (TBD).

The two magic numbers would be used for seeking, identifying and processing the two data block types differently.

My question is: Is there a magic number I can pick for the second block type, which would very unlikely (or better, impossible) to be found inside a bzip2-compressed block of bytes?

In other words, are there particular bytes that bzip2 excludes or would be probabilistically unlikely to use when compressing, within some statistical threshold, which I could use for a header for another data type in the same file?

One option is that, when I find header bytes for a second block type, I would simply try to process data in the second block type, and if that processing fails, then I assume I am accidentally inside a compressed bzip2 block. But I'd like to know if there is the possibility that there are bytes that would not be found in a bzip2 block, or would not be likely to be found.

Alex Reynolds
  • 95,983
  • 54
  • 240
  • 345

1 Answers1

3

No. bzip2 compressed data can contain any pair of bytes, essentially all with equal probability. All you could do would be to define a longer series of bytes as the signature, to reduce the probability that that series accidentally appears in the compressed data. But it still could.

The bzip2 format is self-terminating, so if you're willing to take the time to decode the bzip2 data, you can always find where the next thing is.

To answer the question in a comment, the entire bzip2 stream necessarily terminates on a byte boundary. The last byte may have 0 to 7 bits of zero pad. You can search backwards from the start of your second stream component to look for the bzip2 end marker 0x177245385090 (first 12 decimal digits of the square root of pi), which can start at any bit in a specific byte. It would be 80 to 87 bits back.

Mark Adler
  • 101,978
  • 13
  • 118
  • 158
  • Does it terminate on a byte or on a bit boundary? Is the terminal bit(s) or byte(s) consistent between bzip2 streams? Maybe I can read backwards from my second dataset's header bytes? – Alex Reynolds Nov 08 '16 at 18:56
  • Adler, Are you sure that bzip2 blocks are byte aligned? Seems to me they are not: http://www.forensicswiki.org/wiki/Bzip2 "compressed blocks are bit-aligned and no padding occurs."; http://stackoverflow.com/questions/18262703/bzip2-block-header-1aysy. bzip2recover uses bit scanning for searching blocks; not byte: https://www.ncbi.nlm.nih.gov/IEB/ToolBox/CPP_DOC/lxr/source/src/util/compress/bzip2/bzip2recover.c (main - BLOCK_HEADER_HI, BLOCK_HEADER_LO) – osgx Nov 09 '16 at 00:25
  • 2
    My bad. I assumed that the searches for the block patterns for parallel decompression were byte-aligned, but they are not. I just did a search on a bz2 file, and found the signatures at arbitrary bit boundaries. – Mark Adler Nov 09 '16 at 05:55
  • I wouldn't think a bzip2 file is uniformly random, since it encodes patterns. But as a rough approx., a nine-byte header (such as that in my second dataset) would show up in uniformly random data with a probability of 2.1e-22. A bzip2 end marker would have 3.6e-15 chance of showing up. The probability of these two markers being within 10 to 11 bytes of each other by random chance seems even lower, so working backwards may work fine for this, unless my file is very very unlucky. Still, a lot of us can say it's been a rough, unlucky day today. Thanks for your help, I really appreciate it. – Alex Reynolds Nov 09 '16 at 08:11