1

I've found a lot of questions asking this but some of the explanations were very difficult to understand and I couldn't quite grasp the concept of how to efficiently decompress the file. I have found these related questions: Huffman code with lookup table How to decode huffman code quickly?

But I fail to understand the explanation. I know how to encode and decode a huffman tree regularly. Right now in my compression program I can write any of the following information to file symbol huffman code (unsigned long) huffman code length

What I plan to do is get a text file, separate it into small text files and compress each individually and then decompress that file by sending all the small compressed files with their respective lookup table (don't know how to do this part) to a Nvidia GPU to try to decompress the file in parallel using some sort of look up table.

I have 3 questions: What information should I write to file in the header to construct the look up table? How do I recreate this table from file? How do I use it to decode the huffman encoded file quickly?

Community
  • 1
  • 1
Eddi3
  • 59
  • 1
  • 8
  • 1
    If you're going to compress small bits individually, then make sure you generate the table for the whole file first, otherwise you'll have to have separate tables for each bit which will eat into your compression. – GazTheDestroyer Apr 27 '15 at 08:09
  • ok so make the table for the whole file and put it into GPU memory. Now how do I create the table and how do I use it effectively – Eddi3 Apr 27 '15 at 08:28
  • The only problem I see with making the table for the whole file is that it would make it difficult to determine where to "cut" the bits of the huffman encoded string unless I made a table individually for each segment of the file – Eddi3 Apr 27 '15 at 08:40

1 Answers1

2

Don't bother writing it yourself, unless this is a didactic exercise. Use zlib, lz4, or any of several other free compression/decompression libraries out there that are far better tested than anything you'll be able to do.

You are only talking about Huffman coding, indicating that you would only get a small portion of the available compression. Most of the compression in the libraries mentioned come from matching strings. Look up "LZ77".

As for efficient Huffman decoding, you can look at how zlib's inflate does it. It creates a lookup table for the most-significant nine bits of the code. Each entry in the table has either a symbol and numbers of bits for that code (less than or equal to nine), or if the provided nine bits is a prefix of a longer code, that entry has a pointer to another table to resolve the rest of the code and the number of bits needed for that secondary table. (There are several of these secondary tables.) There are multiple entries for the same symbol if the code length is less than nine. In fact, 29-n multiple entries for an n-bit code.

So to decode you get nine bits from the input and get the entry from the table. If it is a symbol, then you remove the number of bits indicated for the code from your stream and emit the symbol. If it is a pointer to a secondary table, then you remove nine bits from the stream, get the number of bits indicated by the table, and look it up there. Now you will definitely get a symbol to emit, and the number of remaining bits to remove from the stream.

Mark Adler
  • 101,978
  • 13
  • 118
  • 158
  • The poster wanted to build a huffman lookup table implementation that would run on the GPU. No libraries are available for that. I recently built a huffman decoder that runs on the GPU with Metal under iOS, C++ code implements the huffman logic and it can be reused easily. See https://github.com/mdejong/MetalHuffman – MoDJ Jun 20 '18 at 18:42
  • What is the reason for having several secondary tables? – Silicomancer Nov 01 '21 at 12:36
  • Then the tables are smaller and can be built faster than a single table. The decompression is slower if you are building a, potentially, 32,768-entry table for every deflate block. – Mark Adler Nov 01 '21 at 22:28