Huffman Decoding Sub-Table

Question

I've been trying to implement a huffman decoder, and my initial attempt suffered from low performance due to a sub-optimal choice of decoding algorithm.

I thought I try to implement huffman decoding using table-lookups. However, I go a bit stuck on generating the subtables and was hoping someone could point me in the right direction.

struct node
{
    node*               children; // 0 right, 1 left
    uint8_t             value;
    uint8_t             is_leaf;
};

struct entry
{
    uint8_t              next_table_index;
    std::vector<uint8_t> values;

    entry() : next_table_index(0){}
};

void build_tables(node* nodes, std::vector<std::array<entry, 256>>& tables, int table_index);
void unpack_tree(void* data, node* nodes);

std::vector<uint8_t, tbb::cache_aligned_allocator<uint8_t>> decode_huff(void* input)
{
    // Initial setup
    CACHE_ALIGN node                    nodes[512] = {};

    auto data = reinterpret_cast<unsigned long*>(input); 
    size_t table_size   = *(data++); // Size is first 32 bits.
    size_t result_size      = *(data++); // Data size is second 32 bits.

    unpack_tree(data, nodes);

    auto huffman_data = reinterpret_cast<long*>(input) + (table_size+32)/32; 
    size_t data_size = *(huffman_data++); // Size is first 32 bits.     
    auto huffman_data2  = reinterpret_cast<char*>(huffman_data);

    // Build tables

    std::vector<std::array<entry, 256>> tables(1);
    build_tables(nodes, tables, 0);

    // Decode

    uint8_t current_table_index = 0;

    std::vector<uint8_t, tbb::cache_aligned_allocator<uint8_t>> result; 
    while(result.size() < result_size)
    {
        auto& table  = tables[current_table_index];

        uint8_t key = *(huffman_data2++);
        auto& values = table[key].values;
        result.insert(result.end(), values.begin(), values.end());

        current_table_index = table[key].next_table_index;
    }

    result.resize(result_size);

    return result;
}

void build_tables(node* nodes, std::vector<std::array<entry, 256>>& tables, int table_index)
{
    for(int n = 0; n < 256; ++n)
    {
        auto current = nodes;

        for(int i = 0; i < 8; ++i)
        {
            current = current->children + ((n >> i) & 1);       
            if(current->is_leaf)
                tables[table_index][n].values.push_back(current->value);
        }

        if(!current->is_leaf)
        {
            if(current->value == 0)
            {
                current->value = tables.size();
                tables.push_back(std::array<entry, 256>());
                build_tables(current, tables, current->value);
            }

            tables[table_index][n].next_table_index = current->value;
        }
    }   
}

void unpack_tree(void* data, node* nodes)
{   
    node* nodes_end = nodes+1;      
    bit_reader table_reader(data);  
    unsigned char n_bits = ((table_reader.next_bit() << 2) | (table_reader.next_bit() << 1) | (table_reader.next_bit() << 0)) & 0x7; // First 3 bits are n_bits-1.

    // Unpack huffman-tree
    std::stack<node*> stack;
    stack.push(&nodes[0]);      // "nodes" is root
    while(!stack.empty())
    {
        node* ptr = stack.top();
        stack.pop();
        if(table_reader.next_bit())
        {
            ptr->is_leaf = 1;
            ptr->children = nodes[0].children;
            for(int n = n_bits; n >= 0; --n)
                ptr->value |= table_reader.next_bit() << n;
        }
        else
        {
            ptr->children = nodes_end;
            nodes_end += 2;

            stack.push(ptr->children+0);
            stack.push(ptr->children+1);
        }
    }   
}

Ah, oops, thought you were just asking for a general performance review - I see now you have a specific question. Carry on. — bdonlan, Aug 29 '11 at 19:04

bdonlan · Accepted Answer · 2011-08-29T19:42:10.873

1

First off, avoid all those vectors. You can have pointers into a single preallocated buffer, but you don't want the scenario where vector allocates these tiny, tiny buffers all over memory, and your cache footprint goes through the roof.

Note also that the number of non-leaf states might be much less than 256. Indeed, it might be as low as 128. By assigning them low state IDs, we can avoid generating table entries for the entire set of state nodes (which may be as high as 511 nodes in total). After all, after consuming input, we'll never end up on a leaf node; if we do, we generate output, then head back to the root.

The first thing we should do, then, is reassign those states that correspond to internal nodes (ie, ones with pointers out to non-leaves) to low state numbers. You can use this to also reduce memory consumption for your state transition table.

Once we've assigned these low state numbers, we can go through each possible non-leaf state, and each possible input byte (ie, a doubly-nested for loop). Traverse the tree as you would for a bit-based decoding algorithm, and record the set of output bytes, the final node ID you end up on (which must not be a leaf!), and whether you hit an end-of-stream mark.

edited Aug 29 '11 at 19:42

answered Aug 29 '11 at 19:24

bdonlan

224,562
31
268
324

I've updated my post with what I understood from you answer. I'm still unsure as to how to generate the sub-tables. – ronag Aug 29 '11 at 19:37
@ronag, step back and read my post from the top - I don't see any code to renumber inner nodes, for a start. If you don't renumber inner nodes, you must loop through all 511 potential states (and within that loop, all 256 possible input bytes) – bdonlan Aug 29 '11 at 19:42
Isn't that what I do with "current->value = tables.size();", I just allocate actually needed sub-tables? – ronag Aug 29 '11 at 20:00
Ok, thanks to your help I have got it to produce correct results (I've updated y code), however it is exceptionally slow, can't decode data sets larger than 128, it seem to have O(n^2) performance. I seem to get a large amount of sub-tables. – ronag Aug 29 '11 at 20:21
You're not generating these tables once for each byte of input, are you? :) Also, size your vector once, at the start - don't keep copying it around. Generating 256 * 512 entries shouldn't be too bad. – bdonlan Aug 29 '11 at 20:22

Huffman Decoding Sub-Table

1 Answers1