I am making my own DEFLATE compressor, which already beats ZLIB library almost every time.
In DEFLATE format (LZ77), the data stream either contains a Byte literal, or a Back-reference saying, that we should copy a byte sequence from previous decoded bytes. Compressors usually perform LZ77 (find back-references - as many as possible) and then, build a huffman tree and compress that byte/reference stream.
There could be an extreme case: a reference for 3 same bytes, the length is encoded by 15 bits, the distance by 15+13 bits, while the encoded byte is just 1 bit in our huffman tree. The difference is 43 vs. 3 bits. Using the reference made the output 14x larger, than encoding bytes directly.
My problem is following: I have a 10,097,918 B text file. ZLIB covers 9,089,334 B by references in its LZ77, while my compressor covers 9,305,056 B. So I guess I can say that my LZ77 is better.
But ZLIB gives a 1,329,309 B file, while my program gives a 1,381,153 B file (both build an optimal Huffman tree). So a better LZ77 leads to a worse Huffman encoding. Is there any research about Huffman-friendly LZ77?