Generating a Huffman code that never produces the string "00" upon encoding

Question

I am trying to use Huffman coding to create an optimal coding for a set of symbols. However, a constraint is placed on the encodings such that no encoding contains the string, "00".
For example, the encoding 'A' = '0' and 'B' = '10' would not satisfy the constraint because the string 'BA' encodes to '100', which contains the "00" substring.
This means that the code words also cannot contain the string "00". For example, the encoding 'A' = '1', B = '00', and C = '01' would not satisfy the constraint because encoding 'B' would always result in '00' appearing in the encoding.

I have tried modifying the Huffman coding algorithm found on Wikipedia:

Create a leaf node for each symbol and add it to the priority queue.
While there is more than one node in the queue:
1. Remove the two nodes of highest priority (lowest probability) from the queue
  - If both nodes are not leaf nodes, select the highest priority node and the highest priority leaf node. This ensures that at least one of the selected nodes is a leaf node.
2. Create a new internal node with these two nodes as children and with probability equal to the sum of the two nodes' probabilities.
  - If one node is not a leaf node make that node the right child of the new internal node (making it a '1' when encoding). This avoids creating the "00" substring.
3. Add the new node to the queue.
The remaining node is the root node and the tree is complete.
Add a '1' to the beginning of all codes to avoid the "00" substring when two adjacent symbols are encoded.

There is also the case where the only two nodes left in the queue are both non-leaf nodes. I am not sure how to solve this problem. Otherwise, I believe this creates a coding that satisfies the constraint, but I am unsure if it is optimal, and I would like to be certain that it is.

A Huffman-like greedy algorithm is not possible for this problem, because it is not always the case that the two lowest-probability nodes in the optimal tree are actually siblings. (I added a counterexample in a comment to the accepted solution.) Out of curiosity, where did you encounter this problem? It seems to me too hard for a competitive programming site or undergraduate-level course; people have been slogging away on it for over sixty years and it's still not very well understood, as far as I can see. — rici, Feb 17 '23 at 21:22

user3386109 · Answer 1 · 2023-02-16T07:51:12.227

5

I think I'd start with the rule that any "0" in a code must be followed by a "1". That satisfies the constraint that codes are not allowed to contain "00". It also avoids the problem of a "00" substring being produced when two adjacent symbols are encoded.

The resulting code tree is shown below, where

the nodes in the red shaded areas are codes that contain "00"
the nodes containing a red X are codes that end with a "0"
the green nodes are the available valid codes

Note that because a Huffman code is a prefix-free code, choosing one of the valid codes eliminates all of the descendants of that node. For example, choosing to use the code "01" eliminates all of the other nodes on the left side of the tree. To put it another way, choosing "01" makes "01" a leaf, and breaks the two connections below "01".

Also note that the left child of an interior node will have a longer code than the right child, so the child with lower probability must be connected on the left. That's certainly necessary. It's left as an exercise to prove that it's sufficient. (If not sufficient, then the exercise is to come up with the optimal assignment algorithm.)

edited Feb 16 '23 at 07:51

answered Feb 16 '23 at 07:23

user3386109

34,287
7
49
68

It's not optimal. Counter-example with five symbols (and frequencies): `{a:4, b:4, c:4, d:3, e:3}`. In the optimal tree, `d` and `e` are not siblings; the Huffman construction forces them to be, and so one of them ends up with a shorter code than one of the higher-frequency symbols. (In this case, it's close to optimal, but I'm sure there are counterexamples with a larger discrepancy. I didn't spend too much time looking.) It's an interesting problem and fairly well-studied. Afaics, the optimal algorithm for this particular case is O(n^2); the exponent depends on the larger code-length. – rici Feb 17 '23 at 18:07
@rici You're inferring an assignment algorithm. I explicitly did not specify such an algorithm. I merely noted that if every 0 is followed by a 1, then the requirements are met, and the available codes are shown in green. Ultimately, the objective function, which the code assignment needs to minimize, is `sum(P[i] * L[i])` where `P[i]` is the probability of symbol `i`, and `L[i]` is the length of the code assigned to symbol `i`. So to refute my answer, you need to show that there's a coding scheme that meets the requirements, and is better than encoding with `01` and `1` as the code units. – user3386109 Feb 17 '23 at 20:26
@rici On the other hand, if you agree that using code units `01` and `1` is optimal, and you know the optimal assignment algorithm, then it's up to you whether you want to share that with the world. – user3386109 Feb 17 '23 at 20:28
I wasn't trying to refute your answer, just to provide some information respecting the sufficiency of ordering probabilities at a single internal node. (Your "exercise".) I can't say that I know how to solve the assignment problem but I know that an O(n²) solution exists for the simplest unequal weight problem, with weights 1 and 2 (basically DP but using the SMAWK algorithm to reduce the search space). I haven't finished reading the papers yet, and it's not a simple algorithm. I might make a stab at it next week. I found the counterexample I mentioned while trying to understand it. – rici Feb 17 '23 at 20:49
@rici Ah yes, I see. Thanks for clarifying. I took a look at the wikipedia article for SMAWK. The article was not detailed enough for me to adapt to this problem. So I look forward to seeing your explanation. – user3386109 Feb 17 '23 at 21:13

Mark Adler · Accepted Answer · 2023-02-21T07:03:00.157

The easiest way is to not mess with the Huffman code at all. Instead, post-process the output.

We will start with simple bit stuffing. When encoding, take your coded bit stream and whenever there is a 0, insert a 1 after it. On the decoding end, whenever you see a 0, remove the next bit (which will be the 1 that was inserted). Then do the normal Huffman decoding.

This is not optimal, but the departure from optimality is bounded. You can reduce the impact of the bit stuffing by swapping the branches at every node, as needed, to put the lower probabilities or weights on the 0 sides.

This simple bit stuffing expands the input, on average, by a factor of 1.5.

So how close is this simple bit stuffing to optimal? It turns out the the number of possible n-bit patterns that have no occurrences of two 0 bits in a row is F_n+2, where F_n is the n^th Fibonacci number. With such a sequence we can code at most log₂(F_n+2) bits. The optimal expansion ratio of n bits is then n / log₂(F_n+2). In the limit of large n, that is 1 / log₂(), where is the Golden Ratio. That optimal expansion ratio is 1.44042.

So the 1.5 from simple bit stuffing actually isn't too shabby. It's within 4% of optimal.

But we can do better.

We can use the Fibonacci sequence, which represents the number of possible coded values for each bit added to the sequence without repeating 0s, to code the input bits. We show such an approach below, though for convenience, we avoid any repeating 1s instead of any repeating 0s. (Just invert the output to avoid repeating 0s.) Here is example C code that does that encoding and decoding for a fixed-size input and output:

typedef unsigned __int128 u128_t;

// Encode 87 bits to a 125-bit sequence that does not have two 1 bits in a row
// anywhere. Note that if such sequences are concatenated and the high bit of
// the 125 is a 1, then a 0 bit needs to be appended to make it 126 bits. This
// will occur 38.2% of the time (1 / goldenratio^2). The final expansion ratio
// of this encoding is then 125.382 / 87 = 1.44117. The theoretical optimum
// ratio is 1 / lg(goldenratio) = 1.44042. This encoding gets within 0.05% of
// optimal.
u128_t encode87to125(u128_t n) {
    n &= ((u128_t)1 << 87) - 1;
    u128_t e = 0;

    // Fibonacci numbers 126 and 125. (gcc and clang do not support 128-bit
    // literals, so these are assembled, which will occur at compile time.)
    u128_t fn = ((u128_t)0x4f88f2 << 64) | 0x877183a413128aa8u,
           fnm1 = ((u128_t)0x3127c1 << 64) | 0xed0f4dff88ba1575u;
    for (;;) {
        // Encode one bit.
        e <<= 1;
        if (n >= fn) {
            e |= 1;
            n -= fn;
        }

        if (fn == 1)
            // Done when the start of the Fibonacci sequence (1, 1) reached.
            break;

        // Compute the Fibonacci number that precedes fnm1, and move fn and
        // fnm1 both down one in the sequence.
        u128_t fnm2 = fn - fnm1;
        fn = fnm1;
        fnm1 = fnm2;
    }
    return e;
}

// Decode a 125-bit value encoded by encode87to125() back to the orignal 87-bit
// value.
u128_t decode125to87(u128_t e) {
    // Start at the beginning of the Fibonacci sequence (1, 1).
    u128_t d = 0, fn = 1, fnm1 = 1;
    for (;;) {
        // Decode the low bit of e.
        if (e & 1)
            d += fn;
        e >>= 1;

        if (e == 0)
            // Done when there are no more 1 bits in e, since nothing more will
            // be added to d.
            break;

        // Advance fnm1 and fn up one spot in the Fibonacci sequence.
        u128_t fnp1 = fn + fnm1;
        fnm1 = fn;
        fn = fnp1;
    }
    return d;
}

The input is then encoded 87 bits at a time, and the output is 125 or 126 bits for each input block, the latter if the 125 bits has a 1 bit in the top position, in which case a 0 needs to be stuffed.

The values 87 and 125 were picked since they are the most efficient set that permits the output to fit in 128 bits. This gives an expansion ratio of 1.44117, within 0.05% of optimal. Many other choices are possible. The output is encoded a bit at a time, and decoded a bit at a time, so there is no need to accumulate it in an integer. Then we could go to 112 bits encoded in 161 bits. Or we could limit ourselves to 64-bit arithmetic and convert 62 bits to 89 bits (within 0.09% of optimal).

At the end of the bit sequence, the remaining bits can be extended to 87 bits with high zeros, and the encoded bits will then have high zeros that don't need to be sent. When decoding, fill out the last bits of the 125 with high 0s. If you don't know how many bits to expect when decoding, then append a single high 1 bit to the input before encoding, followed by high 0 bits. Then when decoding, scan back from the end through the 0 bits to get to the first 1 bit, and then discard all of those.

This post-processing approach to meet the constraints on the bit stream is arguably preferable to any attempt at modifying the Huffman code to be optimal for different-weight letters. Not only does it permit fast and well-tested Huffman algorithm implementations to be used as is, but also this will work with any entropy coding scheme, be it Huffman, arithmetic, or asymmetric numeral systems.

Yes, you're correct. I have calculated how many bit sequences remain when eliminating all of those with a `00` in them anywhere. (Turns out to be a Fibonacci number.) The result is that, in the limit, you need the log base two of the Golden Ratio, equal to about 1.44, times _n_ bits to represent the information in an unconstrained sequence of _n_ bits. The expansion resulting from the simple bit stuffing that I suggested is 1.5. So the bit stuffing is in fact not optimal. — Mark Adler, Feb 17 '23 at 00:06
There is a way to code this optimally, but the space for this comment is too small to describe it. :-) — Mark Adler, Feb 17 '23 at 00:09
Agreed, this solution provides a good tradeoff between simplicity and optimality. — user3386109, Feb 17 '23 at 00:27
How does this differ from using `01` and `1` as the code units with the standard Huffman algorithm (which, if I understand correctly, is what @user3386109 is proposing)? — rici, Feb 17 '23 at 18:10
@rici I removed my comments after this answer was updated. I suppose I should have left the counter-example that shows that padding with 1's is suboptimal. Consider a message of 8 unique symbols (e.g. `abcdefgh`), i.e. 8 symbols with equal probability. Standard Huffman assigns a 3-bit code to each symbol, and encodes the message with 24 bits. After padding with 1's, the message is 36 bits. A [better code assignment is shown here](https://i.stack.imgur.com/16oF0.png). With that code five symbols are encoded with 4 bits, and three symbols are encoded with 5 bits. Total message size is 35 bits. — user3386109, Feb 17 '23 at 20:09
@rici I don't think that user3386109 proposed that, but yes, this is the same as using Huffman's algorithm and emitting `01` and `1` for `0` and `1`. Huffman's algorithm is then treating the two letters as having equal cost, despite the fact that one of them costs twice as much as the other. — Mark Adler, Feb 18 '23 at 02:38
@user3386109 I have added the very nearly optimal approach to the answer that I said was too small to fit in a comment. — Mark Adler, Feb 21 '23 at 03:22
Interesting. I must admit that I don't fully understand how that works. I'll have to study it some more later. Would give you another +1 if I could. — user3386109, Feb 21 '23 at 16:43

Generating a Huffman code that never produces the string "00" upon encoding

2 Answers2