6

I have troubles with understanding Deflate algorithm (RFC 1951).

TL; DR How to parse Deflate compressed block 4be4 0200?

I created a file with a letter and newline a\n in it, and run gzip a.txt. Resultant file a.txt.gz:

1f8b 0808 fe8b eb55 0003 612e 7478 7400

4be4 0200

07a1 eadd 0200 0000

I understand that first line is header with additional information, and last line is CRC32 plus size of input (RFC 1951). These two gives no trouble to me.

But how do I interpret the compressed block itself (the middle line)?

Here's hexadecimal and binary representation of it:

4be4 0200

0100 1011
1110 0100
0000 0010
0000 0000

As far as I understood, somehow these ones:

Each block of compressed data begins with 3 header bits containing the following data:

  • first bit BFINAL
  • next 2 bits BTYPE

...actually ended up at the end of first byte: 0100 1011. (I'll skip the question why would anyone call "header" something which is actually at the tail of something else.)

RFC contains something that as far as I understand is supposed to be an explanation to this:

  • Data elements are packed into bytes in order of increasing bit number within the byte, i.e., starting with the least-significant bit of the byte.
  • Data elements other than Huffman codes are packed starting with the least-significant bit of the data element.
  • Huffman codes are packed starting with the most- significant bit of the code.

In other words, if one were to print out the compressed data as a sequence of bytes, starting with the first byte at the right margin and proceeding to the left, with the most- significant bit of each byte on the left as usual, one would be able to parse the result from right to left, with fixed-width elements in the correct MSB-to-LSB order and Huffman codes in bit-reversed order (i.e., with the first bit of the code in the relative LSB position).

But sadly I don't understand that explanation.

Returning to my data. OK, so BFINAL is set, and BTYPE is what? 10 or 01?

How do I interpret the rest of the data in that compressed block?

Community
  • 1
  • 1
EugZol
  • 6,476
  • 22
  • 41
  • Your data is not just the letter 'a'. It is the letter 'a' followed by the new line character, '\n' or decimal 10. So there are two bytes coded in there, not just one. – Mark Adler Sep 06 '15 at 15:11
  • @MarkAdler Thanks for pointing that out. – EugZol Sep 06 '15 at 15:12
  • You can use [infgen](http://zlib.net/infgen.c.gz), a deflate stream disassembler, to help you in your understanding of RFC 1951. Both the disassembly of the stream into a readable description, and the infgen.c code itself further illustrates the construction of the deflate format. – Mark Adler Sep 06 '15 at 15:14
  • @MarkAdler That program is of great help. You can add link to it as an answer. While formally it doesn't directly answer my question, it is related and very useful. I won't change accepted answer, but I will definitely give you +1. – EugZol Sep 06 '15 at 15:25

1 Answers1

14

First lets look at the hexadecimal representation of the compressed data as a series of bytes (instead of a series of 16-bit big-endian values as in your question):

4b e4 02 00

Now lets convert those hexadecimal bytes to binary:

01001011 11100100 00000010 000000000

According to the RFC, the bits are packed "starting with the least-significant bit of the byte". The least-significant bit of a byte is the right-most bit of the byte. So first bit of the first byte is this one:

01001011 11100100 00000010 000000000
       ^
       first bit

The second bit is this one:

01001011 11100100 00000010 000000000
      ^
      second bit

The third bit:

01001011 11100100 00000010 000000000
     ^
     third bit

And so on. Once you gone through all the bits in the first byte, you then start on the least-significant bit of the second byte. So the ninth bit is this one:

01001011 11100100 00000010 000000000
                ^
                ninth bit

And finally the last-bit, the thirty-second bit, is this one:

01001011 11100100 00000010 000000000
                           ^
                           last bit

The BFINAL value is the first bit in the compressed data, and so is contained in the single bit marked "first bit" above. It's value is 1, which indicates that this is last block in compressed data.

The BTYPE value is stored in the next two bits in data. These are the bits marked "second bit" and "third bit" above. The only question is which bit of the two is the least-significant bit and which is the most-significant bit. According to the RFC, "Data elements other than Huffman codes are packed starting with the least-significant bit of the data element." That means the first of these two bits, the one marked "second bit", is the least-significant bit. This means the value of BTYPE is 01 in binary. and so indicates that the block is compressed using fixed Huffman codes.

And that's the easy part done. Decoding the rest of the compressed block is more difficult (and with a more realistic example, much more difficult). Properly explaining how do that would be make this answer far too long (and your question too broad) for this site. I'll given you a hint though, the next three elements in the data are the Huffman codes 10010001 ('a'), 00111010 ('\n') and 0000000 (end of stream). The remaining 6 bits are unused, and aren't part of the compressed data.

Note to understand how to decode deflate compressed data you're going to have to understand what Huffman codes are. The RFC you're following assumes that you do. You should also know how LZ77 compression works, though the document more or less explains what you need to know.

Community
  • 1
  • 1
Ross Ridge
  • 38,414
  • 7
  • 81
  • 112
  • Thanks a lot! That'll do it. But why header bits are at the rightmost position? My own simple implementation of BitStream in my programming language will have some trouble plucking them out of there, which could be avoided if they was at the leftmost position. Is it more convinient for some other tools? – EugZol Sep 06 '15 at 09:14
  • 2
    @EugZol It depends how you look at things. The header bits start at the first bit of the data. The first bit of the data is the first bit of the first byte. The bits in a byte are ordered with the least-significant bit first and the most-significant bit last. That's the same as how we usually order numbers, least valued number first, most valued number last. The problem is we write the digits of a number backwards, we put the most significant digit of the number first and the least significant digit last. This flips the order of the binary digits if we try to read them from left to right. – Ross Ridge Sep 06 '15 at 14:30
  • 2
    @EugZol Regardless of how you look at things, it shouldn't make extracting the bits any more difficult. In C you would do something like `bit = byte & 1; byte >>= 1;`. If the bits were extracted the way you expected, with the right-most and most-significant bit of a byte first, then you would have to do something like `bit = (byte >> 7) & 1; byte <<= 1;` instead. – Ross Ridge Sep 06 '15 at 14:35
  • @EugZol Humans, at least most humans, read numbers backwards. If you think about it 12 should be the number twenty one and 21 should be the number twelve. 12 should be twenty one because first you see a one and then you should increase the place value with each subsequent digit. Whenever we read numbers we actually jump the end of the number and read it backwards. – Nick Sotiros May 28 '19 at 13:02
  • @NickSotiros I'm a simple man, I can read forwards and backwards. But when I need to jump back-and-forth I necessarily feel frustration. Despite my love (though little knowledge) of German language ("ein-und-dreißig" = "one-and-thir-teen", and it's 31, not 130 or 131). – EugZol Jun 18 '19 at 15:23
  • @EugZol: The "jumping back and forth" is basically the same problem as endianness. The RFC you quoted in the question explains how to order the bytes and bits so it all makes sense: *In other words, if one were to print out the compressed data as a sequence of bytes, starting with the first byte at the **right** margin and proceeding to the left*. If you use "traditional" hex-dump tools that go from left-to-right but with bit-order within each byte from MSB to LSB, you have a problem very similar to mixed endianness. If you used little-endian bit-order dumps, it would match byte order. – Peter Cordes Oct 06 '20 at 19:23
  • @EugZol: What the RFC is suggesting is basically big-endian byte order, but then read from right to left so it's actually little-endian. So bit and byte order match. In a computer program, there's no "jumping back and forth", that's an artifact of how you dumped. A byte load, or even a little-endian word or halfword load, will get an integer from which you can extract the low bits in order across the whole multi-byte integer. – Peter Cordes Oct 06 '20 at 19:28