0

I want to compress many 32bit number using huffman compression.

Each number may appear multiple times, and I know that every number will be replaced with some bit sequences:

111 010 110 1010 1000 etc...

Now, the question: How many different numbers can be added to the huffman tree before the length of the binary sequence exceeds 32bits?

The rule of generating sequences (for those who don't know) is that every time a new number is added you must assign it the smallest binary sequence possible that is not the prefix of another.

XCS
  • 27,244
  • 26
  • 101
  • 151
  • In theory, maximum of 2^32 sequences can be added to a tree of height 32. In this case it would represent all possible 32bit numbers occurring with same frequency. And huffman code generated for each number will be 32bits. – Rajendran T Jan 16 '12 at 20:24
  • Yes, I have read the basics of Huffman, I have also implemented it. – XCS Jan 19 '12 at 17:09

2 Answers2

1

Huffman is about compression, and compression requires a "skewed" distribution to work (assuming we are talking about normal, order-0, entropy).

The worst situation regarding Huffman tree depth is when the algorithm creates a degenerated tree, i.e. with only one leaf per level. This situation can happen if the distribution looks like a Fibonacci serie.

Therefore, the worst distribution sequence looks like this : 1, 1, 1, 2, 3, 5, 8, 13, ....

In this case, you fill the full 32-bit tree with only 33 different elements.

Note, however, that to reach a 32 bit-depth with only 33 elements, the most numerous element must appear 3 524 578 times.

Therefore, since suming all Fibonacci numbers get you 5 702 886, you need to compress at least 5 702 887 numbers to start having a risk of not being able to represent them with a 32-bit huffman tree.

That being said, using an Huffman tree to represent 32-bits numbers requires a considerable amount of memory to calculate and maintain the tree.

[Edit] A simpler format, called "logarithm approximation", gives almost the same weight to all symbols. In this case, only the total number of symbols is required.

It computes very fast : say for 300 symbols, you will have some using 8 bits, and others using 9 bits. The formula to decide how many of each type :

9 bits : (300-256)*2 = 44*2 = 88 ; 8 bits : 300 - 88 = 212

Then you can distribute the numbers as you wish (preferably the most frequent ones using 8 bits, but that's not important).

This version scales up to 32 bits, meaning basically no restriction.

Cyan
  • 13,248
  • 8
  • 43
  • 78
  • How does the number of appearance of one element influence the depth of the huffman tree? As long as there is an element that you must associate it a single binary sequence, so only one appearance in the tree no matter how many times it shows up in the number-list. – XCS Jan 17 '12 at 20:07
  • Then it's no longer Huffman. The whole point of Huffman is to provide less bits to most occurrence symbols, in order to achieve best compression. In that sense, an Huffman tree is optimal. – Cyan Jan 17 '12 at 20:44
  • Yes, less bits to most occurence symbols, but it doesn't matter if it appears only 10 times or 3 524 578 times, it will still get replaced by the same bit sequence. – XCS Jan 17 '12 at 21:09
  • It does. Even if a symbol has "most occurence" property, it is of utmost important do know its "share" of total occurence. A symbol which appears 50% of time is worth being represented with 1 single bit. While if it appears only 0.4% of total, then it should be represented by 8 bits. In both cases, it can be the "most occurence symbol". – Cyan Jan 18 '12 at 09:00
  • Well, that's what Huffman does, orders elements by number of appearance then replaces them with bit sequences. Not all elements will be replaced with bit sequences of the same length. I don't want to know anything about the number of occurences of an element, I simply want to know how many elements can be replaced with huffman bit sequences before a sequence of 32 bit will be required to store the next element. – XCS Jan 19 '12 at 14:40
  • You seem to confuse Huffman construction algorithm with some kind of other binary tree construction algorithm, probably a dynamic one. Building an Huffman absolutely requires number of occurrences, period. Doing differently is possible, this however will give you something else, which may try to "mimic" Huffman behavior, but which is not Huffman. – Cyan Jan 19 '12 at 20:06
  • If you have 300 different numbers,the Huffman tree will contain 300 different bits sequences (one associated to each number). Those bits sequences are the same for any 300 different numbers. (shortest 300 bits sequences possible as prefix-tree codes).Now, having this 300 sequences we must assoicate them in such way that the number with the highest frequency of appearance gets the shortest bits sequence.So, the bits sequences are the same for any 300 different numbers,the only thing that differs based on the number of appearance of each number is how each sequence is associated to which number. – XCS Jan 22 '12 at 15:12
  • "Those bits sequences are the same for any 300 different numbers" ==> unfortunately, no, that's the flaw. The bit sequences will be optimized, and therefore different, depending on occurrence distribution. Otherwise this is not Huffman. Maybe you are looking for something simpler, such as the "logarithm approximation". In this case, you just need the number of symbols. For 300 symbols, you would have (300-256)*2 = 88 symbols using 9 bits, while the remaining 300-88=212 symbols would use 8 bits. This is extremely fast to compute. And regarding your question, it can scale up to 32 bits. – Cyan Jan 23 '12 at 11:08
  • Do you have any references for "logarithm approximation"? It sounds very similar to (perhaps effectively the same as) [truncated binary encoding](https://en.wikipedia.org/wiki/truncated_binary_encoding). – David Cary Dec 10 '20 at 15:19
  • Yes, this is effectively the same thing. – Cyan Dec 10 '20 at 18:43
1

You seem to understand the principle of prefix codes.

Many people (confusingly) refer to all prefix codes as "Huffman codes".

There are many other kinds of prefix codes -- none of them compress data into any fewer bits than Huffman compression (if we neglect the overhead of transmitting the frequency table), but many of them get pretty close (with some kinds of data) and have other advantages, such as running much faster or guaranteeing some maximum code length ("length-limited prefix codes").

If you have large numbers of unique symbols, the overhead of the Huffman frequency table becomes large -- perhaps some other prefix code can give better net compression.

Many people doing compression and decompression in hardware have fixed limits for the maximum codeword size -- many image and video compression algorithms specify a "length-limited Huffman code".

The fastest prefix codes -- universal codes -- do, in fact, involve a series of bit sequences that can be pre-generated without regard to the actual symbol frequencies. Compression programs that use these codes, as you mentioned, associate the most-frequent input symbol to the shortest bit sequence, the next-most-frequent input symbol to the next-shorted bit sequence, and so on.

For example, some compression programs use Fibonacci codes (a kind of universal code), and always associate the most-frequent symbol to the bit sequence "11", the next-most-frequent symbol to the bit sequence "011", the next to "0011", the next to "1011", and so on.

The Huffman algorithm produces a code that is similar in many ways to a universal code -- both are prefix codes. But, as Cyan points out, the Huffman algorithm is slightly different than those universal codes. If you have 5 different symbols, the Huffman tree will contain 5 different bit sequences -- however, the exact bit sequences generated by the Huffman algorithm depend on the exact frequencies. One document may have symbol counts of { 10, 10, 20, 40, 80 }, leading to Huffman bit sequences { 0000 0001 001 01 1 }. Another document may have symbol counts of { 40, 40, 79, 79, 80 }, leading to Huffman bit sequences { 000 001 01 10 11 }. Even though both situations have exactly 5 unique symbols, the actual Huffman code for the most-frequent symbol is very different in these two compressed documents -- the Huffman code "1" in one document, the Huffman code "11" in another document. If, however, you compressed those documents with the Fibonacci code, the Fibonacci code for the most-frequent symbol is always the same -- "11" in every document.

For Fibonacci in particular, the first 33-bit Fibonacci code is "31 zero bits followed by 2 one bits", representing the value F(33) = 3,524,578 . And so 3,524,577 unique symbols can be represented by Fibonacci codes of 32 bits or less.

One of the more counter-intuitive features of prefix codes is that some symbols (the rare symbols) are "compressed" into much longer bit sequences. If you actually have 2^32 unique symbols (all possible 32 bit numbers), it is not possible to gain any compression if you force the compressor to use prefix codes limited to 32 bits or less. If you actually have 2^8 unique symbols (all possible 8 bit numbers), it is not possible to gain any compression if you force the compressor to use prefix codes limited to 8 bits or less. By allowing the compressor to expand rare values -- to use more than 8 bits to store a rare symbol that we know can be stored in 8 bits -- or use more than 32 bits to store a rare symbol that we know can be stored in 32 bits -- that frees up the compressor to use less than 8 bits -- or less than 32 bits -- to store the more-frequent symbols.

In particular, if I use Fibonacci codes to compress a table of values, where the values include all possible 32 bit numbers, one must use Fibonacci codes up to N bits long, where F(N) = 2^32 -- solving for N I get N = 47 bits for the least-frequently-used 32-bit symbol.

David Cary
  • 5,250
  • 6
  • 53
  • 66