You almost certainly do not need to store the tree itself. You could do, and it shouldn't take the space you think it does, but it's not generally necessary.
If your huffman codes are canonical, you need only store the bit-lengths for each symbol, as this is all the information required to generate a canonical coding. This is a relatively small number of bits per-symbol, so should be fairly compact. You also can further compress that information (see the answer from Aki Suihkonen).
Naturally the bit-length of a code is essentially the same as the tree depth, so I think this is roughly what you're asking about. The important part is to know how to build a canonical code, given the lengths - it's not necessarily the same as the codes produced by traversing the tree. You could regenerate a tree from this, but it's not necessarily the tree you started with - however typically you don't need the tree other than to determine the code lengths in the first place.
The algorithm for generating canonical codes is fairly simple:
- Take all the symbols you want to generate codes for, sorted first by code-length (shortest first), and then by the symbol itself.
- Start with a zero-length code.
- If the next symbol requires more bits than are currently in the code, add zeros to the right (least significant bits) of your code until it's the right length.
- Associate the code with the current symbol, and increment the code.
- Loop back to (3) until you have generated all the symbols.
Take the string "banana". Obviously there are 3 symbols used, 'b', 'a', and 'n', with counts of 1, 3, and 2, respectively.
So the tree might look like this:
*
/ \
* a
/ \
b n
Naively, that could give codes:
a = 1
b = 00
n = 01
However if instead you simply use the bit-lengths as input to canonical code generation, you would produce this:
a = 0
b = 10
n = 11
Its a different code, but obviously it would produce the same length compressed output. Further more, you only need to store the code-lengths in order to reproduce the code.
So you only need to store a sequence:
0... 1 2 0... 2 0...
Where "..." represents easily compressible repetition, and the values will all be quite small (probably only 4-bits each - and note that the symbols aren't stored at all). This representation will be very compact.
If you you really must store the tree itself, one technique is to traverse the tree and store a single bit to indicate whether a node is internal or a leaf, and then for leaf nodes, storing the symbol code. This is fairly compact for trees which do not contain every symbol, and not too bad even for fairly complete trees. The worst case size for this would be the total size of all your symbols, plus as many single bits as you could have nodes. For a standard 8-bit byte stream, that would be 320 bytes (256 bytes for the codes, 511 bits for the tree structure itself).
The method is to start at the root node, and for each node:
- If the node is a parent, output a 0 and then output the left then right children.
- If the node is a leaf, output a 1 and then output the symbol
To reconstruct, perform a similar recursive procedure, but obviously reading the data and choosing whether to recursively create children, or read in a symbol, as appropriate.
For the example above, the bit-stream for the tree would be something like:
0, 0, 1, 'b', 1, 'n', 1, 'a'
That's 5 bits for the tree, plus 3 bytes for the symbols, rounding up to 4 bytes of storage. However it will grow rapidly as you add more symbols, whereas storing the code-lengths does not.