Decoding Huffman file from canonical form

Question

I am writing a Huffman file where I am storing the code lengths of the canonical codes in the header of the file. And during decoding, I am able to regenerate the canonical codes and store them into a std::map<std:uint8_it, std::vector<bool>>. The actual data is read into a single std::vector<bool>. Before anyone suggests me to use std::bitset, let me clarify that Huffman codes have variable bit length, and hence, I am using std::vector<bool>. So, given that I have my symbols and their corresponding canonical codes, how do I decode my file? I don't know where to go from here. Can someone explain to me how I would decode this file since I couldn't find anything proper related to it on searching.

`Generally speaking, the process of decompression is simply a matter of translating the stream of prefix codes to individual byte values, usually by traversing the Huffman tree node by node as each bit is read from the input stream`—[Huffman coding](http://en.wikipedia.org/wiki/Huffman_coding#Decompression). — royhowie, Apr 11 '15 at 07:41

Mark Adler · Answer 1 · 2015-04-11T20:26:35.647

11

You do not need to create the codes or the tree in order to decode canonical codes. All you need is the list of symbols in order and the count of symbols in each code length. By "in order", I mean sorted by code length from shortest to longest, and within each code length, sorted by the symbol value.

Since the canonical codes within a code length are sequential binary integers, you can simply do integer comparisons to see if the bits you have fall within that code range, and if it is, an integer subtraction to determine which symbol it is.

Below is code from puff.c (with minor changes) to show explicitly how this is done. bits(s, 1) returns the next bit from the stream. (This assumes that there is always a next bit.) h->count[len] is the number of symbols that are coded by length len codes, where len is in 0..MAXBITS. If you add up h->count[1], h->count[2], ..., h->count[MAXBITS], that is the total number of symbols coded, and is the length of the h->symbol[] array. The first h->count[1] symbols in h->symbol[] have length 1. The next h->count[2] symbols in h->symbol[] have length 2. And so on.

The values in the h->count[] array, if correct, are constrained to not oversubscribe the possible number of codes that can be coded in len bits. It can be further constrained to represent a complete code, i.e. there is no bit sequence that remains undefined, in which case decode() cannot return an error (-1). For a code to be complete and not oversubscribed, the sum of h->count[len] << (MAXBITS - len) over all len must equal 1 << MAXBITS.

Simple example: if we are coding e with one bit, t with two bits, and a and o with three bits, then h->count[] is {0, 1, 1, 2} (the first value, h->count[0] is not used), and h->symbol[] is {'e','t','a','o'}. Then the code to e is 0, the code for t is 10, a is 110, and o is 111.

#define MAXBITS 15              /* maximum bits in a code */

struct huffman {
    short *count;       /* number of symbols of each length */
    short *symbol;      /* canonically ordered symbols */
};

int decode(struct state *s, const struct huffman *h)
{
    int len;            /* current number of bits in code */
    int code;           /* len bits being decoded */
    int first;          /* first code of length len */
    int count;          /* number of codes of length len */
    int index;          /* index of first code of length len in symbol table */

    code = first = index = 0;
    for (len = 1; len <= MAXBITS; len++) {
        code |= bits(s, 1);             /* get next bit */
        count = h->count[len];
        if (code - count < first)       /* if length len, return symbol */
            return h->symbol[index + (code - first)];
        index += count;                 /* else update for next length */
        first += count;
        first <<= 1;
        code <<= 1;
    }
    return -1;                          /* ran out of codes */
}

edited Apr 11 '15 at 20:26

answered Apr 11 '15 at 14:49

Mark Adler

101,978
13
118
158

Hi, Mark... I really appreciate you taking time to write that out. But, I'm really sorry to say I'm not able to understand most of it. It's not your explanation, but maybe it's my level of understanding that is making it difficult for me. Anyways... Thanks for the explanation, Mark. +1 – WDRKKS Apr 11 '15 at 19:01
Just step through the routine with an example and you'll get it. It doesn't get much simpler than this. Try decoding these bits, from left to right: `10111100` for the example code provided. – Mark Adler Apr 11 '15 at 20:29
OK... Thank you, I'll do that. A couple of things, though. You omitted `struct state` from your code sample. Should I declare that too or not? What exactly does that do? And also, how would I decode the given bits using this function? The function takes an argument of `struct huffman` and I'm reading the bits from the file into a `std::vector` for reasons stated above. Sorry if my question seems silly or foolish, but I actually didn't understand what the function does just by reading through it. – WDRKKS Apr 11 '15 at 21:15
For this routine, the state simply provides how and from where to get the bit. You can replace `bits(s, 1)` with whatever you like to return the next bit from the stream, an integer equal to `0` or `1`. For the example bits I provided in the comment above, the first call of `bits(s, 1)` would return a `1`. The second call would return a `0`, Third call, a `1`. And so on. – Mark Adler Apr 11 '15 at 23:18
OK, Mark... I'll try that out. Thank you very much for your explanation. If I hit any roadblocks, I'll ask here again. I just want to know one thing. Currently, I am writing the bit lengths in the file header. Should I make any changes to that with respect to this code? – WDRKKS Apr 12 '15 at 00:19
Sure. As long as you have transmitted the necessary information to decode, then whatever you're doing is fine. There are more compact ways to transmit the code. Once you get what you're doing working, then you can look at how deflate compresses the code description. deflate run-length encodes the code lengths, and then Huffman codes the code lengths and runs. – Mark Adler Apr 12 '15 at 01:01
@MarkAdler So I've tried your code and it works on 1st try, it correctly decompress compressed data. Unfortunately I do not understand how it works. My own huffman decompressor I wrote was much, much slower, however I understood it. It basically tried to decode all symbols with all lengths (in increasing order) until it succedes. I do not understand how you could optimize it like that so there's only one loop. – Bregalad Aug 18 '17 at 19:28
@Bregalad I guess you just need to study the code. It depends on the code being canonical, in particular that the codes be in integer order from shortest to longest. – Mark Adler Aug 19 '17 at 15:02
Rather than consuming one bit at the time, isn’t it faster consume `MAXBITS` from the stream, find the first symbol length that is smaller, then subtract and index? Might even round bit block read to nearest byte boundary, might require zero padding the symbols. Any guidance to optimize? – TemplateRex Aug 21 '18 at 06:31
1

@TemplateRex The bits are backwards for that, so you would need to reverse them, which the bit-at-a-time code is doing. The reverse could be done with a table, but once you go there, you can also decode using the table. You would then end up with something like zlib's inflate. – Mark Adler Aug 21 '18 at 17:07

muued · Answer 2 · 2015-04-11T10:17:02.400

Your map contains the relevant information, but it maps symbols to codes. Yet, the data you are trying to decode comprises codes. Thus your map cant be used to get the symbols corresponding to the codes read in an efficient way since the lookup method expects a symbol. Searching for codes and retrieving the corresponding symbol would be a linear search.

Instead you should reconstruct the Huffman tree you constructed for the compression step. The frequency values of the inner nodes are irrelevant here, but you will need the leaf nodes at the correct positions. You can create the tree on the fly as you read your file header. Create an empty tree initially. For each symbol to code mapping you read, create the corresponding nodes in the tree. E.g. if the symbol 'D' has been mapped to the code 101, then make sure there is a right child node at the root, which has a left child node, which has a right child node, which contains the symbol 'D', creating the nodes if they were missing.

Using that tree you can then decode the stream as follows (pseudo-code, assuming taking a right child corresponds to adding a 1 to the code):

// use a node variable to remember the position in the tree while reading bits
node n = tree.root
while(stream not fully read) {
    read next bit into boolean b
    if (b == true) {
        n = n.rightChild
    } else {
        n = n.leftChild
    }
    // check whether we are in a leaf node now
    if (n.leftChild == null && n.rightChild == null) {
        // n is a leaf node, thus we have read a complete code
        // add the corresponding symbol to the decoded output
        decoded.add(n.getSymbol())
        // reset the search
        n = tree.root
    }
}

Note that inverting your map to get the lookup into the correct direction will still result in suboptimal performance (compared to binary tree traversal) since it can't exploit the restriction to a smaller search space as the traversal does.

Thanks for replying, muued. I do have some doubts, though. The `decoded.add(n.getSymbol())`... That's where I match the generated code to the existing one in the map and add it to the file, right? I am using `std::back_inserter` for adding. Also, how would one check for leaf node? `node == NULL`, right? Finally, you mentioned inverting the map... where would I be inverting the map? Please tell me this. Thank you. — WDRKKS, Apr 11 '15 at 09:00
I tried to explain it further via an edit. `decoded.add(n.getSymbol())` is where you successfully matched the code read to a symbol and append it to the output file. The leaf check is performed explicitly. You could invert the map to get a mapping from codes to symbols. This direction is the one you want since you read codes and want the corresponding symbols. However, the tree traversal is the better way to do it. — muued, Apr 11 '15 at 09:29
OK... Now I got a good clarity of this. I will try to implement this and see how it goes. On last thing, however. Since I'm building my tree in the compression step using frequencies and since I'm not writing frequencies to the file and also since you mentioned frequency values are irrelevant, how would I go about recreating tree using just my map? I tried to search for this but couldn't find anything. Please just tell me this. I really appreciate your help. Thank you. — WDRKKS, Apr 11 '15 at 09:57
You can use your map or (even simpler) just use the information from the file as you read it saving you the creation of the map altogether. I explained how to create the tree in another edit. As you can see, you just create the nodes and put symbols into the leaves. No frequencies are needed. — muued, Apr 11 '15 at 10:19
OK... I think I can work with that. I'll try to implement this and see what happens. If I run into any problems, I'll post back here again. Thank you very much for your help, muued. — WDRKKS, Apr 11 '15 at 11:04

Decoding Huffman file from canonical form

2 Answers2