1

I am making my own DEFLATE compressor, which already beats ZLIB library almost every time.

In DEFLATE format (LZ77), the data stream either contains a Byte literal, or a Back-reference saying, that we should copy a byte sequence from previous decoded bytes. Compressors usually perform LZ77 (find back-references - as many as possible) and then, build a huffman tree and compress that byte/reference stream.

There could be an extreme case: a reference for 3 same bytes, the length is encoded by 15 bits, the distance by 15+13 bits, while the encoded byte is just 1 bit in our huffman tree. The difference is 43 vs. 3 bits. Using the reference made the output 14x larger, than encoding bytes directly.

My problem is following: I have a 10,097,918 B text file. ZLIB covers 9,089,334 B by references in its LZ77, while my compressor covers 9,305,056 B. So I guess I can say that my LZ77 is better.

But ZLIB gives a 1,329,309 B file, while my program gives a 1,381,153 B file (both build an optimal Huffman tree). So a better LZ77 leads to a worse Huffman encoding. Is there any research about Huffman-friendly LZ77?

Ivan Kuckir
  • 2,327
  • 3
  • 27
  • 46

1 Answers1

3

It's a hard problem, since you only know the bit costs associated with these decisions after making all decisions whether to take the literal or the back-reference. But you could iterate and use the costs from the previous run to make these decisions while re-encoding, several times, until you're happy with the result.

This is one of the things Zopfli does to get better compression ratios while keeping the DEFLATE format, in addition to using a shortest-path approach instead of greedy decisions.

harold
  • 61,398
  • 6
  • 86
  • 164
  • Do you know the ZLIB well, too? I have a file, that is smaller after ZLIB compression, than after my compression. But my compression covers more bytes by back-references (e.g. I have 2x more 3-byte references than ZLIB). So I wonder how they do it. I am bad at reading C code. – Ivan Kuckir May 19 '18 at 14:23
  • @IvanKuckir perhaps you did not implement lazy matching? Taking matches too greedily prevents finding longer matches. It would explain having many more tiny matches – harold May 19 '18 at 14:29
  • I do have lazy matching 2 positions ahead (i.e. if the next position has a longer match, I just write a byte here). It reduces the file size, but it also reduces the number of bytes, covered by references ... so weird – Ivan Kuckir May 19 '18 at 14:35
  • make sure too that you are using the back-reference with the shortest distance from the current position... to help minimize the number of extra bits because of matches being long distances away. – DynasticSponge Jun 02 '20 at 21:53