2

I am studying Huffman code for bit encoding a stream of characters and read that an optimal code would be represented by a full binary tree where each distinct character is represented by a leaf and all internal nodes contain exactly two children .

I want to know why the full binary tree is the optimal choice here ? In other words what is the advantage of full binary tree here ?

Geek
  • 26,489
  • 43
  • 149
  • 227
  • You'd probably want to read [*this*](http://xlinux.nist.gov/dads/HTML/optimalMerge.html) – Nir Alfasi Sep 17 '12 at 08:01
  • Where did you read this? – Deestan Sep 17 '12 at 08:06
  • 1
    @deestan Greedy algorithms chapter in [Introduction to algorithms](http://ocw.mit.edu/courses/electrical-engineering-and-computer-science/6-046j-introduction-to-algorithms-sma-5503-fall-2005/) – Geek Sep 17 '12 at 08:08

3 Answers3

3

This is not a choice, but rather equivalence.

Optimal Huffman codes are decoded by a finite state machine, in which

  • each state has exactly two exits (the next bit being 0 or 1)
  • each state has exactly one entry
  • all states containing output symbols are stop states, and
  • all stop states contain output symbols

This is equivalent to a search tree where

  • all internal nodes have exactly two children
  • all nodes have exactly one parent
  • all nodes containing output symbols are leaf nodes, and
  • all leaf nodes contain output symbols

There are non-optimal Huffman codes as well, which have stop states / leaf nodes that do not contain output symbols. Such a binary tree would not be full.

Simon Richter
  • 28,572
  • 1
  • 42
  • 64
  • what do you mean by "each state has exactly one entry " . Also can u provide a picture that shows all four requirements of decoding of Huffman code being satisfied by a full binary tree. – Geek Sep 17 '12 at 08:26
  • For each node (except the start), there is only one edge leading to it (i.e. there are no two input symbol sequences leading to the same state). – Simon Richter Sep 17 '12 at 08:47
  • Binary trees always fulfill the first two ("binary" -> two children per internal node, "tree" -> cycle free). – Simon Richter Sep 17 '12 at 08:54
  • The third follows from the second. If an internal node had an output symbol, then you either need a third input symbol to terminate here (e.g. the pause in Morse code), or you cannot output any symbol from a later state without also generating the earlier one (which is equivalent to a tree that generates two output symbols in the stop state. – Simon Richter Sep 17 '12 at 09:03
  • The fourth is not a requirement for Huffman code -- it is what makes a Huffman code "optimal". In an optimal code, bitstrings are either incomplete or generate an output symbol. – Simon Richter Sep 17 '12 at 09:07
3

Proof by contradiction:

Let us say that the tree T is not a full binary tree which provides optimal Huffman codes for the given characters and their frequencies. As T is not a full binary tree, there exists a node N which has only one child C.

Let us construct a new binary tree T' by replacing N with C. Depth of leaf nodes of C are reduced by 1 in T' compared to tree T. So T' provides a better solution that T, which proves that T is not optimal.

  T                T'

  /\              /\
 .  N            .  C
.  /            .
. C             .
  • You should also say that the tree T' you get by replacing N by C yields a tree that still represents a prefix code because no internal node corresponds to a character (since this was true in T before replacing N by C) – tail_recursion Oct 02 '22 at 04:30
0

You asked why a full binary tree. That is actually three questions.

If you're asking about "full", then it must be full for any correctly generated Huffman code.

If you're asking about "binary", every encountered bit in a Huffman code has two possibilities, 0 or 1, so each node must have two branches.

If however you're asking about "tree", you do not need to represent the code as a tree at all. There are many representations that not only represent the code completely, but also that facilitate both a shorter representation in the compressed stream and faster decoding, than a tree would.

Examples are using a canonical Huffman code, and representing it simply as the counts of symbols at each bit length, and a list of corresponding symbols. This is used in the puff.c code. Or you can generate a set of tables that decode several bits at a time in stages, which is used in zlib's inflate. There are others.

Mark Adler
  • 101,978
  • 13
  • 118
  • 158