1

I have a code which implements Huffman coding to encode texts.

Given the following text

abbccc

My program generates the following table

a -> 00
b -> 01
c -> 1

So the encoded text (bit array) is

000101111

The problem is: I need to encode the table together with the text and I do not know what is the recommended approach for this.

What I have thought so far:

  • First byte is the number N of key-value pairs in table
  • Following N*2 bytes are the key-value pairs themselves (one byte for key, one byte for value)
  • Remaining bits are the encoded text itself

Could you suggest me some more flexible yet inexpensive (low memory usage) approach for this?

Gabriel
  • 1,922
  • 2
  • 19
  • 37
  • 1
    I'm pretty sure the DEFLATE algorithm uses Huffman coding.. you can take a peek at how they do it. Warning the document is a bit dense and difficult to understand. I have yet to understand it. From what I recall several years back, if you follow a certain procedure to build the huffman tree, then store the KVP in some sorted order, there is a unique representation, which makes storing the table very compact. https://www.ietf.org/rfc/rfc1951.txt – ithenoob Jun 03 '17 at 02:57
  • 1
    If you are out to store the table in the minimum possible size (you don't really say what "inexpensive" means), then look up canonical Huffman codes. For starters, https://en.wikipedia.org/wiki/Canonical_Huffman_code . I _think_ this is what @ithenoob is discusssing? – Gene Jun 03 '17 at 03:50
  • @Gene: I was specifically referring to the way DEFLATE manages its dictionary, which I believe is also compressed in some arcane way. However, the wikipedia page you linked to seems more promising as it is more readable and explicitly states that compact huffman representations will be discussed. – ithenoob Jun 03 '17 at 03:58
  • @Gene, "inexpensive" refers to memory. I edited my question to make it clear. Thanks for pointing this out. – Gabriel Jun 03 '17 at 05:12
  • @ithenoob, I'm gonna read this spec ASAP, Thank you. – Gabriel Jun 03 '17 at 05:12
  • "one byte for key, one byte for value" does not work because the value is variable-length and can easily be multiple bytes. The DEFLATE suggestion is obviously good... I believe the Wikipedia page is describing the same algorithm. – Nemo Jun 03 '17 at 05:26
  • See also https://stackoverflow.com/a/34569068/768469 (written by Mark Adler, the zlib guy) – Nemo Jun 03 '17 at 05:31

1 Answers1

2

Deflate RFC1951 stores the Huffman table in front of the compressed data. See Section 3.2.7. You don't need to store the codes (in your case 00,01,1) but their lengths (that is 2,2,1). Section 3.2.2 describes how you convert those lengths back to codes when decompressing. The table is described by a sequence of lengths of all the symbols you will have, in your small example it will be something like 0,0,0,...,2,2,1,....0. Zeros indicate that those symbols do not appear in the file, except for a,b,c whose lengths are 2,2,1. To make this table of lengths compact you can do run length encoding. In Deflate (Section 3.2.7) the length symbol 18(n) encodes a sequence of 0 lengths n times. Next question is how to encode the length symbol '18'? You can use 5 bit codes to represent length symbols 0 to 18 to make it simpler. Or you can also Huffman encode them, for example with 0-7 bit codes which is what Deflate does.

B Abali
  • 433
  • 2
  • 10