1

I'm building a Python program to compress/decompress a text file using a Huffman tree. Previously, I would store the frequency table a .json file alongside the compressed file. When I read in the compressed data and .json, I would rebuild the decompression tree from the frequency table. I thought this was a pretty eloquent solution.

However, I was running into an odd issue with files of medium length where they would decompress into strings of seemingly random characters. I found that the issue occurred when two character where occurring the same number of times. When I rebuilt my tree, any of those characters with matching frequencies would have the chance of getting swapped. For the majority of files, particularly large and small files, this wasn't a problem. Most letter occurred slightly more or slightly less than others. But for some medium sized files, a large portion of the characters occurred the same number of times as another character resulting in gibberish.

Is there a unique identifier for my nodes that I can use instead to easily rebuild my tree? Or should I be approaching the tree writing completely differently?

Jack Thias
  • 70
  • 1
  • 9

1 Answers1

2
  1. In the Huffman algorithm you need to pick the lowest two frequencies in a deterministic way that is the same on both sides. If there is a tie, you need to use the symbol to break the tie. Without that, you have no assurance that the sorting on both sides will pick the same symbols when faced with equal frequencies.

  2. You don't need to send the frequencies. All you need to send is the bit lengths for the symbols. The lengths can be coded much more compactly than the frequencies. You can build a canonical code from just the lengths, using the symbols to order the codes unambiguously.

Mark Adler
  • 101,978
  • 13
  • 118
  • 158