How is huffman coding table built in practice?

Question

My question is specific. I can see that the theory of huffman coding is easy to understand. However, it seems that it creates codes that usually do not align to byte boundaries. The practical method to mitigate this specific issue is not dealt in the tutorials I have come across yet.

There are two problems:

(1) Once a file is encoded, the resulting huffman code file's end of file may not align at byte boundary. How do we know that we have reached end of huffman coded data in a compressed file?

(2) Provided that a huffman table is included in the file to help decompression, how is such a table created in practice since we again encounter non alignment with byte boundaries? The symbols themself may be 8 or 16 bits. However, the huffman code can be any number of bits. Now if we include a huffman code per code, we will also have to include how many bits it is so the huffman table can be used by the decoder to create a binary tree or some other data structure to help with decompression.

Huffman and Arithmatic coding seem to be used in a lot of compression systems and thus this question keeps popping up.

I am trying to understand how this is done in JPEG and will be building an encoder in C using a Nios II soft core processor in an FPGA to save JPEG file in SD Card from a Camera.

I don't see how this is related to [tag:c], specifically. As far as I'm aware, Huffman coding is language-agnostic. — Nic, Mar 16 '17 at 02:07
EOF is always byte aligned by the use of padding, which is always required when the stream ends off a byte boundary (otherwise the extra bits which have arbitrary values would be parsed). — Myst, Mar 16 '17 at 04:50

score 2 · Accepted Answer · answered Mar 16 '17 at 21:07

An additional symbol is defined as an end code. When that code is encountered, you have reached the end of the stream. Unless there is another bit stream of some sort following, what is normally done is to discard any remaining bits in the last byte to go to the next byte boundary.
There are many ways of varying sophistication depending on how important it is to compress the code description. You can read the deflate description in RFC 1951 for one way, and the brotli description in RFC 7932. In both cases, the codes themselves are not sent. Instead the code lengths for each symbol is sent, and the codes themselves are constructed canonically from the lengths and symbol order. In deflate the lengths are sent for each symbol, with a length of zero sent for symbols not coded. That series of lengths in run-length encoded, and then is itself Huffman coded. That Huffman code to decode the code lengths is sent first, which is sent using a fixed number of bits for each length (3) and is again constructed canonically. (Look up Canonical Huffman codes.) JPEG has yet another way to encode the code lengths when not using a pre-defined Huffman code.

How is huffman coding table built in practice?

1 Answers1