2

I'm working on a Huffman coding/decoding project in C and have a good understanding of how the algorithm should store information about the Huffman tree, re-build the tree during decoding, and decompress to the original input file using variable-length codes.

When writing to my compressed file, I will output a table of 256 4-byte integers containing unique frequencies, and I know I will also have to figure out a way to handle EOF - worrying about that later.

My question is how should I complete the necessary bit-wise operations to write a stream of variable-length codes to a series of 1-byte iterations of fwrite.

If I've created the following (fictitious) codes:

a: 001010101010011
b: 100
c: 11111
d: 0

The bitstream for "abcd" would be:

001010101010011100111110

I know I'll need to use some bit-wise operations to "chop" this stream up into writeable bytes:

00101010|10100111|00111110

A first attempt at creating 8 different cases based upon lengths of the codes did not work out well and I'm stumped. Is there an easier way to handle variable-length codes when writing to a file?

Thank you

tiger2015
  • 31
  • 3
  • How are you representing the bitstreams internally? One bit per byte? A char array? That will affect which bit-ops you need. – kdopen Feb 18 '15 at 00:11
  • As of now, my bitstreams are stored in a 2-d char array codes[256][30] in which the longest code is 17 long. So if ASCII 'a' is encountered with code "0110", the writer will have to write 1 bit for each of the the following chars: codes[97][0] = '0' codes[97][1] = '1' codes[97][2] = '1' codes[97][3] = '0' – tiger2015 Feb 18 '15 at 00:13

2 Answers2

1

Here's some pseudo-code to give you the general idea:

static byte BitBuffer = 0;
static byte BitsinBuffer = 0;

static void WriteBitCharToOutput(char bitChar);
// buffer one binary digit ('1' or '0')
{
  if (BitsInBuffer > 7)
  {
    stream.write(BitBuffer);
    BitsInBuffer = 0;
    BitBuffer = 0; // just to be tidy
  }

  BitBuffer = (BitBuffer << 1) | (bitChar == '1' ? 1 : 0);
  BitsInBuffer++;
}

static void FlushBitBuffer()
// call after last character has been encoded
// to flush out remaining bits
{
  if (BitsInBuffer > 0)
  do
  {
    WriteBitCharToOutput('0'); // pad with zeroes
  } while (BitsInBuffer != 1);
}
  • This is very helpful pseudo-code, thank you! I understand the necessity of flushing the final incomplete byte with zeroes, and resetting the buffer when it reaches a length of 8. Could you explain what bitops are happening in this (crucial) line of code? BitBuffer = (BitBuffer << 1) | (bitChar == '1' ? 1 : 0); – tiger2015 Feb 18 '15 at 00:37
  • The first part shifts the buffer one bit to the left, the second part encodes the current, incoming character into the buffer, as a bit - either 1 or 0. – 500 - Internal Server Error Feb 18 '15 at 00:39
  • If I'm calling such a function repeatedly from a for() loop stepping through the chars in a, should the function return the bitbuffer and bitsinbuffer values? How can I keep track of my buffer and its length if I have to repeatedly call the function? – tiger2015 Feb 18 '15 at 00:41
  • The function that I outline above forwards the encoded data to a stream. You can also collect the encoded bytes in a buffer, if you prefer. You'll have to record the length of each Huffman code in a table so that you'll know how many bits to send to output for each character you encode. – 500 - Internal Server Error Feb 18 '15 at 00:46
  • Fortunately, I already have such a table! This code was extremely helpful, thank you for the warm welcome to the SO community. Last question, when flushing out the bits, shouldn't the do-while logic contain: while (BitsInBuffer != 0); ? – tiger2015 Feb 18 '15 at 00:50
  • It does - it sends a zero (might as well have been a one) to the output until the pending byte is written to the output stream. Which reminds me: make sure you have an EOF symbol defined in your Huffman table that you encode as the last one out so that you'll know, during decoding, when to stop. – 500 - Internal Server Error Feb 18 '15 at 01:03
0

As an alternative to the other answer, if you want to write several bits at once to your buffer, you can. It could look something like this: (this is meant to be pseudocode, though it looks fairly real)

uint32_t buffer = 0;
int bufbits = 0;
for (int i = 0; i < symbolCount; i++)
{
    int s = symbols[i];
    buffer <<= lengths[s];  // make room for the bits
    bufbits += lengths[s];  // buffer got longer
    buffer |= values[s];    // put in the bits corresponding to the symbol

    while (bufbits >= 8)    // as long as there is at least a byte in the buffer
    {
        bufbits -= 8;       // forget it's there
        writeByte((buffer >> bufbits) & 0xFF); // and save it
    }
}

Not shown: obviously you have to save anything left over in the buffer when you're done writing to it.

This assumes that the maximum code length is 25 or less. The maximum number of bits that can be left in the buffer is 7, 7+25 is the longest thing that fits in a 32 bit integer. This is not a bad limitation, usually the code length is limited to 15 or 16 to allow the simplest form of table-based decoding without needing a huge table.

harold
  • 61,398
  • 6
  • 86
  • 164