5

I am studying the inner-workings of Gzip, and I understand that it uses a combination of Huffman Coding and LZ77.

I also realize that a Gzip file is divided into blocks, and each block has a dictionary built for it. Then frequent occurrences of similar data are replaced by pointers pointing at locations in the dictionary.

So the phrase "horses race other horses" would have the word horses replaced by a pointer.

However what if I have an array of 32 bit integers, but it only stores numbers up to 24 bits? For arguments sake lets say these 24 bit numbers are very random and hard to compress and hard to find repetition in.

This would make the first 8 bits of every integer an easy to compress string of 0's, but each string would need a pointer and each pointer still takes up some amount of data. Even a 1 bit pointer (which I know is smaller than what's realistically possible) would still take up 12.5% of the original space.

That would seem somewhat redundant when the array could easily be reduced to a "24 bit" array, with basic pattern recognition.

So my question is:

Does Gzip contain any mechanisms to better compress a file than dictionary pointers?

How well can Gzip compress small amounts of repetitive data, followed by small amounts of hard to compress data?

YAHsaves
  • 1,697
  • 12
  • 33

2 Answers2

4

Each deflate block does not have a "dictionary built for it". What is built for each deflate block is a set of Huffman codes for the literal/length symbols and the distance symbols.

The dictionary you refer to is simply the 32K bytes of uncompressed input that immediately precede the bytes currently being compressed. That's it. Each length/distance pair can refer to a string of 3 to 258 bytes in the last 32K. That is independent of deflate blocks, and such references often go back one or more blocks.

Deflate will not do well trying to compress a sequence of three random bytes, zero byte, three random bytes, zero byte ... There will be no useful repeated strings, where deflate will only be able to Huffman code the literals, with zeros being more frequent. It would code zeros as two bits, since they occur a little more than 25% of the time, and the rest of the literals to at least 8.25 bits each. For this data that would give an average of about 6.7 bits per byte or a compressed ratio of 0.85. In fact gzip gives about 0.86 on this data.

If you want to compress that sequence, simply remove the zero bytes! Then you are done, with no further compression possible at a ratio of 0.75.

Mark Adler
  • 101,978
  • 13
  • 118
  • 158
  • Thank you for the answer! It really helps me understand gzip better. The reason I wanted to use a 32 bit string is because it's what's naturally used by my cpu, so when you have a 24 bit string you have to do bit shift operations and it's more logic to manage. However I guess I'm stuck doing it that way, thank you! – YAHsaves May 27 '18 at 21:40
0

There are many cases where the data appears random, but there is underlying structure. In such cases, general bit compressors will do terribly, but preprocessing can yield amazing results.

For example, if you have a bunch of x,y coordinates, the numbers all seem fairly random, but if they represent the track of a vehicle, then the deltas will be anything but random. So delta encoding and other more complicated curve fitting and subtracting off the known part can dramatically reduce the size of the storage.

Dov
  • 8,000
  • 8
  • 46
  • 75