Storing Probability table during text compression

Question

I am doing a project where I compare different types of Text compression methods such as Huffman and Arithmetic for both static and adaptive form. I make a probability table for both using the number of occurrence of each letter in the text. Now, for adaptive form, the receiver does not need the Probability table but for the static form, we need to transmit this probability table as well to the receiver for decoding the message. Now this storing of the table will need some extra bits, which should be taken into account while comparing.

So my question here is:

What is the best solution for storing the probability table (in a file).
What is the minimum number of bits required to do that? (I know it depends on the text, but is there some way to find the minimum bits required to store the table).

Thank you very much.

score 0 · Answer 1 · 2014-07-27T00:40:33.090

From the probabilities, you assign code lengths to symbols. To create the code the receiver needs a list of tuples: (bit count, symbol count), followed by the symbols in the order to be allocated to the code. Now you can play around with how you encode those.

Encoding the list of symbols can use the fact that for every symbol transmitted, the number of bits you need for following symbols goes down. An option to specify early on that some subset of (say) 8-bit symbols is used can help here. As the code words get longer, it may be handy to have an encoding for a run of symbols, rather than transmitting each one -- perhaps with a way to express a run less a few symbols, where the "holes" can be expressed in some number of bits which depends on the length of the run -- or a start symbol, length and bit-vector (noting that the number of bits to express the length depends on the start symbol and the number of symbols left, and there is no need to send a bit for the first and last in the range !)

The encoding of the Huffman code table is a whole game in itself. Then for short messages, the table can be a serious overhead... in which case, a (small) number of commonly useful tables may give a better compression.

You can also mess about with a Huffman encoding for the code length of each symbol, and send those in symbol order. A repeat count mechanism, with its Huffman can help here, and a way of skipping runs of unused symbols (ie symbols with zero code lengths). You can, of course, add a first level table to specify the encoding for this !

Another approach is a number of bit vectors, one vector for each code word length. Starting with the code word length with the most symbols, emit the length and a bit vector, then the next most populous code length with a smaller bit vector... and so on. Again, a way to encode runs and ranges can cut down the number of bits required, and again, as you proceed, the bits required for those goes down.

The question is, how sensitive is the comparison to the size of the code table ? Clearly, if it is very sensitive, then investigating what can be done by the application of cunning is important. But the effectiveness of any given scheme is going to depend on how well it fits "typical" data being compressed.

score -1 · Answer 2 · answered Jul 26 '14 at 19:31

-1

One common way to convey a 0-order table (that is, one with only single tokens and no look-ahead) is by simply prepending all possible symbols in decreasing frequency order. The probabilities usually don't need to be stored because the coding only requires the ordered set of symbols and not their actual probabilities.

For a compression scheme encoding 8-bit tokens, and assuming all tokens are at least theoretically possible, this would mean 256 bytes of overhead. For cases in which only a subset of bytes is possible (e.g. messages consisting only of uppercase letters and numbers), the table is, of course, be smaller.

answered Jul 26 '14 at 19:31

Edward

6,964
2
29
55

The coding actually does depend on their probabilities, not merely on their order. See https://stackoverflow.com/questions/8885703/maximum-number-of-different-numbers-huffman-compression/9171567#9171567 for an example. – David Cary Aug 01 '19 at 00:58
The question was not solely about Huffman coding, but all kinds of coding and specifically asked about minimization of the prefix table size. Universal coding does not require frequency data to be explicitly represented in the table, exactly as I have described. – Edward Aug 01 '19 at 07:41
Yes, [universal codes](https://en.wikipedia.org/wiki/Universal_code_(data_compression)) do not require frequency data, only ranking, which is an advantage they have over the Huffman algorithm and Arithmetic coding mentioned in the original question. – David Cary Aug 03 '19 at 03:02
Please explicitly state in your answer that Huffman codes need more information than the possible symbols and their order ( https://stackoverflow.com/questions/8885703/maximum-number-of-different-numbers-huffman-compression/9171567#9171567 ), and explicitly state that Fibonacci and other universal codes require only the possible symbols and their order (sacrificing a little compression in order to reduce overhead), and I'll upvote. – David Cary Aug 04 '19 at 13:18

score -1 · Answer 3 · answered Jul 31 '19 at 17:34

There are a variety of ways to store the probability information that a Huffman or Arithmetic decompressor needs in order to decode the compressed information into (an exact copy of) the original plaintext.

As Mark Adler mentioned in a related question ( Storing table of codes in a compressed file after Huffman compression and building tree for decompression from this table ),

You do not need to transmit the probabilities or the tree. All the [Huffman] decoder needs is the number of bits assigned to each symbol, and a canonical way to assign the bit values to each symbol that is agreed to by both the encoder and decoder. See Canonical Huffman Code.

I'm assuming you're using a byte-oriented Huffman code, with each compressed code decoding into one of 256 possible bytes.

Perhaps the simplest method of storing those bit lengths is an exhaustive table of exactly 256 bit-lengths, one for each possible byte. So for example, the 65th entry in the table gives the bit-length of the letter 'A' (the 65th ASCII character), which may be 1 (when A is extremely common) to perhaps 12 (when A is extremely rare), or 0 (indicating A never occurs in this text). Each length easily fits in 1 byte, so that table is 256 bytes long.

(Almost always the maximum length is 15 bits or less, so usually each length can easily fit into half-a-byte, giving a table that is always exactly 128 bytes long -- but dealing with "pathological" data files that trick the Huffman algorithm into assigning some plaintext bytes a symbol longer than 15 bits can be tricky. Some systems specifically check if the maximum length is more than 15 bits, and artificially change the Huffman tree to force all the lengths to at most 15 bits -- sometimes called constrained-depth Huffman tree or length-limited Huffman coding. Likewise, the JPEG standard limits Huffman code lengths to 16 bits).

More compact (and more difficult to describe) approaches are used to store the 4 Huffman tables of bit-lengths in JPEG images and the many Huffman tables used in a DEFLATE stream in a varable-length table whose length varies with the particular data -- but all of them first reduce the probability information to be stored to just the bit-lengths of the symbols. (Perhaps you could just use the DEFLATE implementation, rather than writing and debugging something from scratch?)

My understanding is that arithmetic coding generally uses higher-precision probability information, at least for the most-frequent symbols, than Huffman codes. Please tell me if you find an efficient way to transmit that information to the receiver.

Storing Probability table during text compression

3 Answers3

Linked