DEFLATE method reasoning

Question

Why does LZ77 DEFLATE use Huffman encoding for it's second pass instead of LZW? Is there something about their combination that is optimal? If so, what is the nature of the output of LZ77 that makes it more suitable for Huffman compression than LZW or some other method entirely?

They could have gone for a range coder as the backend (but it's slower and it would be a bit annoying to put those extension bits inside the bitstream), or today probably ANS. — harold, Sep 29 '16 at 10:53

score 0 · Answer 1 · edited Jun 20 '20 at 09:12

Mark Adler could best answer this question.

The details of how the LZ77 and Huffman work together need some closer examination. Once the raw data has been turned into a string of characters and special length, distance pairs, these elements must be represented with Huffman codes.

Though this is NOT, repeat, NOT standard terminology, call the point where we start reading in bits a "dial tone." After all, in our analogy, the dial tone is where you can start specifying a series of numbers that will end up mapping to a specific phone. So call the very beginning a "dial tone." At that dial tone, one of three things could follow: a character, a length-distance pair, or the end of the block. Since we must be able to tell which it is, all the possible characters ("literals"), elements that indicate ranges of possible lengths ("lengths"), and a special end-of-block indicator are all merged into a single alphabet. That alphabet then becomes the basis of a Huffman tree. Distances don't need to be included in this alphabet, since they can only appear directly after lengths. Once the literal has been decoded, or the length-distance pair decoded, we are at another "dial-tone" point and we start reading again. If we got the end-of-block symbol, of course, we're either at the beginning of another block or at the end of the compressed data.

Length codes or distance codes may actually be a code that represents a base value, followed by extra bits that form an integer to be added to the base value.

...

Read the whole deal here.

Long story short. LZ77 provides duplicate elimination. Huffman coding provides bit reduction. It's also on the wiki.

score 0 · Accepted Answer · answered Sep 29 '16 at 01:34

0

LZW tries to take advantage of repeated strings, just like the first "stage" as you call it of LZ77. It then does a poor job of entropy coding that information. LZW has been completely supplanted by more modern approaches. (Except for its legacy use in the GIF format.) Once LZ77 generates a list of literals and matches, there is nothing left for LZW to take advantage of, and it would then make an almost completely ineffective entropy coder for that information.

answered Sep 29 '16 at 01:34

Mark Adler

101,978
13
118
158

Are there other compression methods that would perform the same role as Huffman encoding meaning benefitting from high frequencies of characters? Is Huffman encoding proved to be optimal? – Gerald Collom Sep 29 '16 at 03:14
Yes, there are several others. E.g. arithmetic coding, range coding, and finite state entropy offer better compression at the cost of speed. – Mark Adler Sep 29 '16 at 15:23
Why is the performance of the compression more valuable than the size of compression? Is the difference in performance that great? Also, thank you I will look into these other methods. – Gerald Collom Sep 29 '16 at 17:43
As one example, you compress to reduce the amount of data you need to transmit. However if the time it takes to reduce the data by some amount is more than the time it would have taken to simply transmit that same amount, then it might not be worthwhile to do the compression. – Mark Adler Sep 29 '16 at 17:56
But couldn't you be moving/storing compressed data a lot without repeating the compression/decompression? Wouldn't size still be the ultimate decider for storage? There are also two separate contexts that I am considering these questions in, the first being an implementation in an open source library but the second being the information theory side of just theoretical compression and things. Are you answering only within the first context? – Gerald Collom Sep 29 '16 at 18:17
Even for storage you can be limited by human patience. If one compression method takes a few minutes and compresses 75%, whereas another method takes a few hours can compresses 83%, you'll probably go for the 75%. At some point, marginal gains are not worth the effort. – Mark Adler Sep 29 '16 at 18:36
Mark, thank you so much for sticking around through all my questions. I might just keep asking until you stop... But to your example, there is some combination of performance/compression that would be worth it then. Like if for example it only took tens of minutes to get an increase of more like 20% compression? What would those numbers have to look like for zlib, for example to use that algorithm instead? – Gerald Collom Sep 29 '16 at 20:24

DEFLATE method reasoning

2 Answers2