3

need some help to understand how DEFLATE Encoding works. I know that is a combination of the LZSS algorithm and Huffman coding.

So let encode for example "Deflate late". Params: [Search buffer: 8kb and Look-ahead buffer 4kb] Well, the output of LZSS algorithm is "Deflate <5, 4>" The next step uses static huffman coding to reduce the redundancy. Here is my problem, I dont know how should i encode this pair <5, 4> with huffman.


[Edited]

D 000
f 001
l 010
a 011
t 100
_ 101
e 11

So well, according to this table the string "Deflate " is written as 000 11 001 010 011 100 11 101. As a next step lets encode the pair (5, 4). The fixed prefix code of the length 4 according to the book "Data Compression - The Complete Reference" is 258, followed by fixed prefix code of the distance 5 (Code 4 + 1 Extra bit).

That can be summarized as:

length 4 -> 258 -> 0000010
distance 5 -> 4 + 1 extra bit -> 00100|0

So, the encoded string is written as [header: 1 01] 000 11 001 010 011 100 11 101 0000010 001000 [end-of-block: 0000000], BUT if i create a huffman tree, it is not a static huffman anymore, right?

Good day

FewG
  • 33
  • 1
  • 4
  • Since you didn't ask how to encode the "Deflate", then you must already know how to emit the Huffman codes for those literals. You do exactly the same thing where you emit a length of 4 instead of a literal, followed by a distance code of 5. – Mark Adler Jul 01 '13 at 14:05
  • So well, according to this table the string "Deflate " is written as 000 11 001 010 011 100 11 101. As a next step lets encode the pair (5, 4). The fixed prefix code of the length 4 according to the book "Data Compression - The Complete Reference" is 258, followed by fixed prefix code of the distance 5. [summarized as]: length 4 -> 258 -> 0000010 [7 Bits] distance 5 -> 4 + 1 extra bit -> 00100|0 So, the encoded string is written as [header: 1 01] 000 11 001 010 011 100 11 101 0000010 001000 [end-of-block: 0000000], BUT if i create a huffman tree, it is not a static huffman, right? – FewG Jul 01 '13 at 21:56

1 Answers1

14
D 000
f 001
l 010
a 011
t 100
_ 101
e 11

is not the Deflate static code. The static literal/length codes are all 7, 8, or 9 bits, and the distance codes are all 5 bits. You asked about the static codes.

'Deflate late' encoded in static deflate format as the literals 'Deflate ' and a length 4, distance 5 match in hex is:

73 49 4d cb 49 2c 49 55 00 11 00

That is broken down as follows (bits are read from the least significant part of each byte first):

011 - 01 means fixed code, 1 means last block
00101110 - D
10101001 - e
01101001 - f
00111001 - l
10001001 - a
00100101 - t
10101001 - e
00001010 - space
0100000 - length 4
00100 - distance 5 or 6 depending on one extra bit
0 - extra bit -> distance 5
0000000 - end code
0 - fill bit to byte boundary
Mark Adler
  • 101,978
  • 13
  • 118
  • 158
  • Could you elaborate why 00101110 = D and 10101001 = e? – TLJ Jul 09 '15 at 04:58
  • 4
    Read RFC 1951, section 3.2.6. D = 0x44. Add 0x30 to get 0x74. Reverse the bits and get 00101110. e = 0x65. Add 0x30 to get 0x95. Reverse the bits and get 10101001. – Mark Adler Jul 09 '15 at 06:35
  • 3
    The RFC is extremely confusing and has half-baked tables in decimal instead of binary, and doesn't clearly explain how the tables were derived. Where did you get 0x30? Is it always + 0x30, or is that based on the value you're adding to? Why are you reversing the bits? Do you know of any other resources (beside the RFC) that provide clear examples of inflating compressed data, and explain why the fixed tables are what they are? – Dan Bechard Dec 21 '16 at 01:17
  • 1
    I looked, but could not find any half-baked tables in the [RFC](https://tools.ietf.org/html/rfc1951). The 0x30 is the number of 7-bit codes in the fixed Huffman code (256 to 279). Those precede the 8-bit codes in the canonical construction of the Huffman code, so to code literals in the range 0 to 143, you add 0x30 to account for the codes that precede those. The bits are reversed per the convention noted in 3.1.1. – Mark Adler Dec 21 '16 at 04:06
  • 1
    To possibly aid in your understanding of RFC 1951, you can look at [puff.c](https://github.com/madler/zlib/blob/master/contrib/puff/puff.c), which is a simple inflator written to unambiguously define the format. – Mark Adler Dec 21 '16 at 04:07
  • 1
    The fixed code was defined by Phil Katz when he created the format. Since he is no longer with us, we can only speculate on how he came up with that particular code. – Mark Adler Dec 21 '16 at 04:08
  • thanks for this, helped me a lot with my decoder. i noticed the lower number of each distance code "range" in RFC1951 is very similar to https://oeis.org/A209721 , a(n) = 2*a(n-2)-1 , and the high side might be exactly https://oeis.org/A029744 – don bright Feb 07 '23 at 17:33