When using Huffman encoding for binary data, how are characters determined?

Question

All the examples of Huffman encoding I've seen use letters (A, B, C) as the character being encoded, in which they calculate the frequencies of each to generate the Huffman tree. What happens when the data you want to encode is binary? I've seen people treat each byte as a character, but why? It seems arbitrary to use 8 bits as the cutoff for a "character", why not 16? Why not 32 for 32-bit architecture?

Probably because those are examples which are simplified (and everyone knows the ascii table somewhat). I would assume (not sure) that for example in deflate, the entropy-coding is maybe not byte-wise. From a performance perspective, byte-alignment (not necessarily one byte) probably makes sense. 32-bit inputs also would generate bigger huffman-trees which must be stored and transmitted, so there is also a trade-off in terms of data-size. — sascha, Apr 20 '19 at 17:47

score 0 · Answer 1 · answered Aug 05 '19 at 15:10

That is perceptive of you to realize that Huffman encoding can work with more than 256 symbols. A few implementations of Huffman coding work with far more than 256 symbols, such as

HuffWord, which parses English text into more-or-less English words (typically blocks of text with around 32,000 unique words) and generates a Huffman tree where each leaf represents a English word, encoded with a unique Huffman code
HuffSyllable, which parses text into syllables, and generates a Huffman tree where each leaf represents (approximately) an English syllable, encoded with a unique Huffman code
DEFLATE, which first replaces repeated strings into (length, offset) symbols, has several different Huffman tables, one optimized for representing distances (offsets), and another with 287 symbols where each leaf represents either a specific length (part of the (length, offset) symbol) or a literal byte.
Some of the length-limited Huffman trees used in JPEG compression encode JPEG quantized brightness values (from -2047 to +2047 ?) with maximum of 16-bit code lengths.

On a 16-bit architecture or 32-bit architecture computer, ASCII text files and UTF-8 text files and photographs are pretty much the same as on 8-bit computers, so there's no real reason to switch to a different approach.

On a 16-bit architecture or 32-bit architecture, typically machine code is 16-bit aligned, so static Huffman with 16-bit symbols may make sense.

Static huffman has the overhead of transmitting information about bitlengths for each symbol, so that the receiver can reconstruct the codewords necessary to decompress. The 257 or so bitlengths in the header of 8-bit static Huffman are already too much for "short string compression". As sascha pointed out, using 16 bits for a "character" would require much more overhead (65,000 or so bitlengths), so static Huffman coding with 16-bit inputs would only make sense for long files where that overhead is less significant.

When using Huffman encoding for binary data, how are characters determined?

1 Answers1