Huffman trees for non-binary alphabets?

Question

Is there an easy generalization of Huffman coding trees for situations where the resulting alphabet is not binary? For instance, if I wanted to compress some text by writing it out in ternary, I could still build up a prefix-free coding system for each character I as writing out. Would the straightforward generalization of the Huffman construction (using a k-ary tree rather than a binary tree) still work correctly and efficiently? Or does this construction lead to a highly inefficient coding scheme?

Obvious approach is to try it out on some data with 3-ary and 4-ary trees, and compare compression to the standard huffman encoding and the entropy of the data. I actually kinda expect it to be a better approximation of the entropy than standard huffman, but that's just a guess. — Null Set, Mar 27 '11 at 21:27
Probably in that case end nodes of tree will have 3 leafs instead of 2 and everything else will stay same. — Alexei Polkhanov, Mar 27 '11 at 21:29
To whoever downvoted - can you please explain what I can do to improve this question? — templatetypedef, Mar 27 '11 at 21:32
It was not me, but I suspect that people get frustrated when someone ask question to which simple Google query would return plenty of answers and many times before on Stack Overflow. — Alexei Polkhanov, Mar 27 '11 at 21:46
@Alexei: certainly I get frustrated if people downvote questions for being duplicated, but don't vote them for closure as duplicates. — Steve Jessop, Mar 27 '11 at 23:07
Closing it as duplicate make sense, but how how do I do that? — Alexei Polkhanov, Mar 28 '11 at 05:23
@Alexei: Unfortunately in your case, step 1 is "get 3000 reputation". Step 2 is "find a question that it's a dupe of", so if you provide a link to that, then others who have the rep can vote this question closed. — Steve Jessop, Mar 28 '11 at 09:29
Funny enough, three years after this conversation initiated, the first result that I get for "non-binary Huffman" in Google search is *this answer* in Stackoverflow. Quite often downvoters react in an epileptic way, and that's frustrating. — nightcod3r, Jan 27 '15 at 09:27

score 7 · Accepted Answer · answered Mar 27 '11 at 21:27

7

The algorithm still works and it's still simple — in fact Wikipedia has a brief reference to n-ary Huffman coding citing the original Huffman paper as a source.

It does occur to me, though, that just as Huffman is slightly suboptimal because it allocates an integer number of bits to each symbol (unlike e.g. Arithmetic coding), ternary Huffman should be a little bit more suboptimal because it has to allocate an integer number of trits. Not a show-stopper, especially for only 3, but it does indicate that n-ary Huffman will fall further behind other coding algorithms as you increase n.

answered Mar 27 '11 at 21:27

hobbs

223,387
19
210
288

I don't quite understand why you say, _"... n-ary Huffman will fall further behind other coding algorithms as you increase n"_. If you could elaborate on why, that'd be great! – Anish Ramaswamy Sep 18 '14 at 21:47
@AnishRamaswamy Huffman coding is slightly suboptimal (when compared to e.g. range or arithmetic coding) because every codeword is an integer number of symbols (bits). That number of bits is never less than the ideal entropy of that codeword, but sometimes more. For ternary or higher n, the amount of information represented by each symbol goes up as n goes up, so the waste from "rounding up" to a whole number of symbols would also tend to increase. – hobbs Sep 19 '14 at 04:08

score 4 · Answer 2 · answered Mar 28 '11 at 00:37

As an empirical test, I constructed binary and trinary Huffman trees for the distribution of Scrabble tiles.

The entropy of the distribution shows you can't get better than 4.37 bits per letter.

The binary Huffman tree uses on average 4.41 bits per letter.

The trinary Huffman tree uses on average 2.81 trits per letter, which has the same information density as 4.45 bits per letter.

Huffman trees for non-binary alphabets?

2 Answers2

Linked