A (memoryless) source generates symbols a1, a2, ... an with relative frequencies f1, f2, ... ,fn. Assume that the frequencies are all positive.
An optimal prefix-free code for this source can be determined using most variants of the Huffman algorithm or possibly other algorithms.
While there are many different optimal prefix-free codes for this source, they all have the same expected codeword length:
(f1 * l1 + f2 * l2 + ... + fn * ln) / (f1 + f2 + ... + fn)
Here, li denotes the length, in bits, of the codeword assigned to symbol ai.
Call two such frequency lists equivalent if the expected codeword lengths of the optimal prefix-free codes that they determine are equal.
Problem
Describe an algorithm that, given list of strictly positive frequencies, produces an equivalent list, consisting only of (positive) integers and whose largest element is minimal.
Partial Answer
In the hopes of encouraging progress towards a full solution, I'll post a sketch of a partial solution that I thought of but couldn't complete.
Any full binary tree with n leaves determines a prefix code for n symbols. If the leaves are labelled with the given frequencies, and the internal nodes are labelled with the sum of the frequencies of their children then the sum of the labels of the internal nodes (including the root), divided by the root label, gives the expected word length of the code.
If the code was constructed using the Huffman algorithm, then the expected word length is minimal (among prefix codes).
Start by constructing a Huffman tree for the given frequencies.
If we modify the tree in such a way that the sum of the leaf labels is preserved and the sum of the internal node labels is preserved then the resulting tree still determines an optimal prefix code.
For example, for any two leaves at the same level, increasing one leaf label and decreasing the other, by the same amount, preserves necessary invariants.
If the two leaves don't have the same level then the situation is a little more complicated, but modifications can still be made to the leaf labels that preserve the invariants.
It remains only to:
- define a sufficiently rich set of modifications that will always suffice to "optimize" the tree (ie minimizes the maximum leaf label).
- give an algorithm for applying modifications from this list in the right order to get an optimal tree.
The idea is similar to the application of a series of elementary row operations to an augmented matrix to transform to reduced echelon form while preserving the solution set of the original linear system at each step.