0

I'm implementing Huffman encoding in c++ and I can successfully build a Huffman tree and can encode/decode strings.

The next thing I want to do is be able to encode/decode files, but I have a few problems. I'm using bool vectors to contain the code words. My problem is: I can only write bytes to a file. How do I write bit by bit? Is there perhaps a library I can use?

The other thing is, that if I want to decode a file I need the the tree itself (or the code table). What's the best way to serialize the tree?

Any help would be much appreciated.

  • Two possible choices: Encode the bits into bytes. Or use one byte per bit. – Some programmer dude Nov 05 '16 at 14:48
  • This is your format specification, so do what you want. If you want to pack the bits tight then join the bit vectors writing them 8 bits at a time. If you want to write your codes in byte aligned blocks, do that. You can turn your tree into an array (look at tree traversals) or an edge list and write that however you like. There are too many options as you didn't really specify what you already have... – BeyelerStudios Nov 05 '16 at 14:51
  • As for your second problem, there are basically only three ways to [traverse a tree](https://en.wikipedia.org/wiki/Tree_traversal). Pick one. And instead of "display" the tree write it to disk. Do the reverse when reading the file. – Some programmer dude Nov 05 '16 at 14:51
  • And remember when you read your tree again that you might be on a different CPU architecture, So chose a endianess for the file/stream and for example use hton on producer and ntoh on consumer. – Surt Nov 05 '16 at 16:29
  • Possible duplicate of [Bitstream of variable-length Huffman codes - How to write to file?](http://stackoverflow.com/questions/28573597/bitstream-of-variable-length-huffman-codes-how-to-write-to-file) – SamB Nov 06 '16 at 02:23

1 Answers1

2

Too bad the internal format of a C++ bool vector is undefined, since it is very likely to already be packed bits.

Anyway, you would use the <<, >>, and & operators to pack bits into bytes on the encoding side, as well as to unpack the bits on the decoding side. Assuming that you know that a byte is made up of eight bits, then this is trivial to do.

As for transmitting a Huffman code, read about canonical Huffman codes. You do not need to send the code, just the code length in bits for each symbol. For more efficiency, the sequence of lengths can itself be compressed, with run-length and Huffman coding. See the Deflate format for an example.

Community
  • 1
  • 1
Mark Adler
  • 101,978
  • 13
  • 118
  • 158