Storing and reconstruction of Huffman tree

Question

What is the best way to dehydrate a huffman tree, by dehydration I mean given a huffman tree, and the characters in each leaf, how can you efficiently store the structure of this tree, and later reconstruct it.

take the below tree:

---------------garbage------
 -------------/-------\------
 ------------A-------garbage-
 --------------------/-----\-
 -------------------B-------C-

one idea might be to store the symbol at each level and then use this information to reconstruct the tree. In this case: A1B2C2. So how can I first get the levels, and associate each level with the character.

What have you tried? Where are you getting stuck? Building Huffman Trees is very cheap computationally, so there's little point in storing in a special format. Just store the sorted list that you use to construct the tree in the first place. — us2012, Feb 26 '13 at 04:25
i use this to do a compression, so enough information is needed to be stored in the compressed file to later use this information to reconstruct the tree and uncompress. If i use the sorted list it will take 256*4 bytes, but I am looking for a way to save more space in my compressed file. One idea is to traverse the tree and associate each symbol with their levels on the tree, I am working on this idea but I am getting stuck on the implementation. — EasyQuestions, Feb 26 '13 at 04:38
or if i reframe the question, how can I traverse a tree, store the level, and associate the leaf symbol at that level with that level. — EasyQuestions, Feb 26 '13 at 05:00

score 5 · Accepted Answer · edited May 23 '17 at 12:18

You almost certainly do not need to store the tree itself. You could do, and it shouldn't take the space you think it does, but it's not generally necessary.

If your huffman codes are canonical, you need only store the bit-lengths for each symbol, as this is all the information required to generate a canonical coding. This is a relatively small number of bits per-symbol, so should be fairly compact. You also can further compress that information (see the answer from Aki Suihkonen).

Naturally the bit-length of a code is essentially the same as the tree depth, so I think this is roughly what you're asking about. The important part is to know how to build a canonical code, given the lengths - it's not necessarily the same as the codes produced by traversing the tree. You could regenerate a tree from this, but it's not necessarily the tree you started with - however typically you don't need the tree other than to determine the code lengths in the first place.

The algorithm for generating canonical codes is fairly simple:

Take all the symbols you want to generate codes for, sorted first by code-length (shortest first), and then by the symbol itself.
Start with a zero-length code.
If the next symbol requires more bits than are currently in the code, add zeros to the right (least significant bits) of your code until it's the right length.
Associate the code with the current symbol, and increment the code.
Loop back to (3) until you have generated all the symbols.

Take the string "banana". Obviously there are 3 symbols used, 'b', 'a', and 'n', with counts of 1, 3, and 2, respectively.

So the tree might look like this:

    *
   / \
  *   a
 / \
b   n

Naively, that could give codes:

a = 1
b = 00
n = 01

However if instead you simply use the bit-lengths as input to canonical code generation, you would produce this:

a = 0
b = 10
n = 11

Its a different code, but obviously it would produce the same length compressed output. Further more, you only need to store the code-lengths in order to reproduce the code.

So you only need to store a sequence:

0... 1 2 0... 2 0...

Where "..." represents easily compressible repetition, and the values will all be quite small (probably only 4-bits each - and note that the symbols aren't stored at all). This representation will be very compact.

If you you really must store the tree itself, one technique is to traverse the tree and store a single bit to indicate whether a node is internal or a leaf, and then for leaf nodes, storing the symbol code. This is fairly compact for trees which do not contain every symbol, and not too bad even for fairly complete trees. The worst case size for this would be the total size of all your symbols, plus as many single bits as you could have nodes. For a standard 8-bit byte stream, that would be 320 bytes (256 bytes for the codes, 511 bits for the tree structure itself).

The method is to start at the root node, and for each node:

If the node is a parent, output a 0 and then output the left then right children.
If the node is a leaf, output a 1 and then output the symbol

To reconstruct, perform a similar recursive procedure, but obviously reading the data and choosing whether to recursively create children, or read in a symbol, as appropriate.

For the example above, the bit-stream for the tree would be something like:

0, 0, 1, 'b', 1, 'n', 1, 'a'

That's 5 bits for the tree, plus 3 bytes for the symbols, rounding up to 4 bytes of storage. However it will grow rapidly as you add more symbols, whereas storing the code-lengths does not.

This is done also in zlib, where the bitlengths themselves are further compressed with a static Huffman tree. Unfortunately rewriting the code from zlib/inflate specification is not an easy task. — Aki Suihkonen, Feb 26 '13 at 07:41
the encoding and decoding process is a straightforward task to do. The main point is to keep the compressed file as small as possible. The compressed file has the sequence of encoded symbols, but it also needs to store the tree it was used to encode to later decode the code. sometimes this will be costly in terms of space, since the point of compressing is to save space. I am looking for a clever way to dehydrate the tree in the compressed file. My assumption is that with your suggested way of storing tree, it will take 1024 bytes in the worst case. — EasyQuestions, Feb 27 '13 at 02:57
That sounds wrong to me. The tree shouldn't take anything like that much to encode, but I don't understand why you want to encode the tree at all. It's the codes you need, not the tree. — JasonD, Feb 27 '13 at 06:31
yes, but the codes should then be decoded to reconstruct the original file. And this is only possible if you use the same tree that you used when encoding. — EasyQuestions, Feb 27 '13 at 10:27
No. You need the same *codes*, not the same tree. Obviously you need to use the canonical coding for both compression and decompression, but the tree is only needed to generate the initial code-lengths, which is all you need to store. — JasonD, Feb 27 '13 at 10:39
storing a bit for saying that a node is leaf or not and then associate that leaf node with the symbol sounds interesting, can you produce a pseudocode, so i can better understand the logic? — EasyQuestions, Feb 27 '13 at 10:48
I have tried to elaborate on both techniques. I would stress that the first, building the canonical code and not storing the tree, is almost certainly the better approach. — JasonD, Feb 27 '13 at 11:14

score 2 · Answer 2 · answered Feb 27 '13 at 07:49

The zlib specification explains that to store a Huffman tree one only needs the bitlengths of each symbol. E.g. if one constructs a tree for A=101, B=111, C=110, D=01, one will simply count the bitlengths and regenerate the tree from the lengths so that the keywords will be consecutive --> A=101,B=110,C=111, D=01. (or what ever the following code produces)

set bl_count[2]=1, bl_count[3]=3 and iterate:

code = 0;   // From z-lib specification, RFC 1951
bl_count[0] = 0;
for (bits = 1; bits <= MAX_BITS; bits++) {
    code = (code + bl_count[bits-1]) << 1;
    next_code[bits] = code;
}

As the maximum symbol length will be <16, one needs a maximum of 4 bits per symbol to store these lengths: 3,3,3,2 == 0011 0011 0011 0010; however, zlib/deflate does better -- it run length encodes these symbols using escape symbol such as 16 == run of 3, 17: run of 4, etc. to further compress the stream of symbol lengths. Also the RLE takes case of zero lengths, i.e. missing characters.

i don't understand how this is possible, given that a the content of a file (symbols) doesn't produce a unique Huffman tree (there can be multiple Huffman tree generated by a single information source\), and really depends on the implementation, for example if the tree was built with the 0child as the left child versus the 0child as the right child — EasyQuestions, Feb 27 '13 at 10:42
The zlib documentation introduces an algorithm to convert any valid huffman tree to a _canonical_ tree, that has exactly the same ability to compress, even though it means that to encode "A" one has to output "100110" instead of the original symbol being "110100". The canonical tree is reproducible from the symbol lengths. — Aki Suihkonen, Feb 27 '13 at 11:08

Storing and reconstruction of Huffman tree

2 Answers2