4

To cut a long story short, I'm trying to generate Huffman codes from a canonical Huffman list. Essentially, the following two loops should run, and generate a binary string. The code is:

for (int i = 1; i <= 17; i++) {
        for (int j = 0; j < input.length; j++) { 
            if (input[j] == i) {
                result.put(allocateCode(i, j), j); //update a hashmap
                huffCode += (1 << (17 - i)); //Update the huffman code
            }
        }

    }

Essentially the code should look for all codes with a length of 1 and generate a Huffman code for each. So for example, lengths of 1 should go (in order): 0, 1. And lengths of three will go 100, 101, 110.

The allocateCode function simply returns a string that shows the result, the first run produces this:

Huffman code for code 2 is: 0 (0) length was: 1
Huffman code for code 6 is: 10 (2) length was: 2
Huffman code for code 0 is: 1100 (12) length was: 4
Huffman code for code 3 is: 1101 (13) length was: 4
Huffman code for code 4 is: 1110 (14) length was: 4
Huffman code for code 7 is: 11110 (30) length was: 5
Huffman code for code 1 is: 111110 (62) length was: 6
Huffman code for code 5 is: 111111 (63) length was: 6

This is correct, and the right Huffman codes have been generated. However, running it on a second array of lengths produces this:

Huffman code for code 1 is: 0 (0) length was: 1
Huffman code for code 4 is: 1 (1) length was: 1
Huffman code for code 8 is: 100 (4) length was: 3
Huffman code for code 9 is: 100 (4) length was: 3
Huffman code for code 13 is: 101 (5) length was: 3
Huffman code for code 16 is: 1011000 (88) length was: 7
Huffman code for code 10 is: 10110001 (177) length was: 8
Huffman code for code 2 is: 101100011 (355) length was: 9
Huffman code for code 3 is: 101100011 (355) length was: 9
Huffman code for code 0 is: 1011001000 (712) length was: 10
Huffman code for code 5 is: 1011001000 (712) length was: 10
Huffman code for code 6 is: 1011001001 (713) length was: 10
Huffman code for code 7 is: 10110010011 (1427) length was: 11
Huffman code for code 14 is: 10110010011 (1427) length was: 11
Huffman code for code 17 is: 10110010100 (1428) length was: 11
Huffman code for code 19 is: 10110010100 (1428) length was: 11
Huffman code for code 18 is: 101100101010000 (22864) length was: 15

As you can see, the same code is generated multiple times, examples are code 8 & 9, and codes 2 & 3.

I think my problem lies within the nested loops, however I can't figure out why it works perfectly for one run, and fails on another.

I might just be missing something obvious, but I can't see it for looking.

Any advice would be greatly appreciated.

Thanks

UPDATE

After going back through my code, it seems that I was actually making a small mistake when reading in the data in the first place, hence I was getting incorrect Huffman codes!!

Tony
  • 3,587
  • 8
  • 44
  • 77
  • You are updating the codes as you go. It is normal for the huffman codes to change based on the values you have seen already. The other problem you have is there is no way to know where your values stop e.g. is `1011000` 4144111 or 581 – Peter Lawrey May 12 '13 at 10:46
  • I'm not quite sure what you mean Peter? I must confess I still get a bit confused by Huffman codes. Could you elaborate? – Tony May 12 '13 at 10:47
  • I assume you are generating your huffman codes in advance or your question doesn't make sense. Normally you update the huffman code as you process each symbol. This means a code which mean one symbol can mean a different symbol later. – Peter Lawrey May 12 '13 at 10:49
  • The codes I'm generating are coming from a canonical list stored in an already compressed stream. I'm just trying to figure out the right code, and then store it. Each of the two examples shown above are from two different canonical lists. I think I'm getting more confused :) – Tony May 12 '13 at 10:56
  • I can say from your first two values you have a problem. You have given 0 one symbol and 1 another. This means no other symbol can use a 0 or 1. – Peter Lawrey May 12 '13 at 10:57
  • 1
    I did see that. I've got a feeling I'm reading in the wrong data in the first place. I'll have to go back and take a look at my input method. Thanks Peter. – Tony May 12 '13 at 10:58
  • BTW 00 and 01 are not the same as 0 and 1. I suspect this is what your first two symbols should be. – Peter Lawrey May 12 '13 at 11:01
  • let us [continue this discussion in chat](http://chat.stackoverflow.com/rooms/29800/discussion-between-tony-and-peter-lawrey) – Tony May 12 '13 at 16:04
  • The problem with chat is that I am not on the computer all weekend (or at least I try not to be) – Peter Lawrey May 12 '13 at 19:27

1 Answers1

1

The first two codes in your second example both have length one, which leaves no other possible codes after those first two. All prefix patterns have been used up.

Your code should keep a count of the available remaining codes to detect an erroneous input. Simply decrement the count for each code, and double the count every time you move up to the next length one more than the current length. (Make sure you double even if there are no codes of that length, e.g. if you move from codes of length 3 to codes of length 5, double the count for codes of length 4 even though there are none.) Start the count at two for length one codes.

If that count ever goes negative, you have an error and you can stop right there. It is not possible to assign codes to that set of lengths.

If at the end of the process the count is not zero, then you have an incomplete code. That may or may not be an error depending on your application. It means that the code is not optimal, and fewer bits could have been used to code those symbols.

Mark Adler
  • 101,978
  • 13
  • 118
  • 158