2

A (memoryless) source generates symbols a1, a2, ... an with relative frequencies f1, f2, ... ,fn. Assume that the frequencies are all positive.

An optimal prefix-free code for this source can be determined using most variants of the Huffman algorithm or possibly other algorithms.

While there are many different optimal prefix-free codes for this source, they all have the same expected codeword length:

(f1 * l1 + f2 * l2 + ... + fn * ln) / (f1 + f2 + ... + fn)

Here, li denotes the length, in bits, of the codeword assigned to symbol ai.

Call two such frequency lists equivalent if the expected codeword lengths of the optimal prefix-free codes that they determine are equal.

Problem

Describe an algorithm that, given list of strictly positive frequencies, produces an equivalent list, consisting only of (positive) integers and whose largest element is minimal.

Partial Answer

In the hopes of encouraging progress towards a full solution, I'll post a sketch of a partial solution that I thought of but couldn't complete.

Any full binary tree with n leaves determines a prefix code for n symbols. If the leaves are labelled with the given frequencies, and the internal nodes are labelled with the sum of the frequencies of their children then the sum of the labels of the internal nodes (including the root), divided by the root label, gives the expected word length of the code.

If the code was constructed using the Huffman algorithm, then the expected word length is minimal (among prefix codes).

Start by constructing a Huffman tree for the given frequencies.

If we modify the tree in such a way that the sum of the leaf labels is preserved and the sum of the internal node labels is preserved then the resulting tree still determines an optimal prefix code.

For example, for any two leaves at the same level, increasing one leaf label and decreasing the other, by the same amount, preserves necessary invariants.

If the two leaves don't have the same level then the situation is a little more complicated, but modifications can still be made to the leaf labels that preserve the invariants.

It remains only to:

  1. define a sufficiently rich set of modifications that will always suffice to "optimize" the tree (ie minimizes the maximum leaf label).
  2. give an algorithm for applying modifications from this list in the right order to get an optimal tree.

The idea is similar to the application of a series of elementary row operations to an augmented matrix to transform to reduced echelon form while preserving the solution set of the original linear system at each step.

sitiposit
  • 149
  • 1
  • 13
  • 1
    A correctly implemented Huffman algorithm would discard a _qi_ of zero and the associated symbol, so your example of "0,1,2,3" would not in fact code four symbols. It would code three symbols. Your example solution is incorrect. – Mark Adler Feb 22 '22 at 02:33
  • 2
    Your other example answer is wrong as well. The tree produced by 0.14, 0.25, 0.3, 0.31 is simply two branches each with two branches, so the codes are all two bits in length. This can be reproduced by the frequencies 1, 1, 1, 1, which has a smaller max frequency than 3, 4, 5, 6. – Mark Adler Feb 22 '22 at 02:42
  • 1
    Similarly, the tree that results from 0.1, 0.2, 0.3, 0.4 is reproduced by the frequencies 1, 1, 2, 3. – Mark Adler Feb 22 '22 at 02:44
  • @MarkAdler Your corrections are appreciated. On reconsideration, I realize that my post is not as clear or unambiguous (or correct) as I intended. I will carefully compose a clearer version. You obviously understood my intent and I hope that you will be able to contribute insight towards a solution once I get it posted. – sitiposit Feb 22 '22 at 13:30
  • I made extensive changes to clarify the problem description and incorporate errors specified in the comments. – sitiposit Feb 22 '22 at 17:15
  • Now that I think about it, the second last code listed can't possibly be right. It seems to have a higher average codeword length than the final code. Perhaps I am not properly understanding the Huffman algorithm. – sitiposit Feb 22 '22 at 17:22
  • 1
    You need to compute a weighted average using the frequencies. Both codes give a weighted average of exactly two bits per symbol. – Mark Adler Feb 22 '22 at 18:44

0 Answers0