-6

I have a Huffman code algorithm that compresses characters into sequences of bits of arbitrary length, smaller than the default size of a char (8 bits on most modern platforms) HuffmanCodes

If the Huffman Code compresses an 8-bit character into 3 bits, how do I represent that 3-bit value in memory? To take this further, how do I combine multiple compressed characters into a compressed representation?

For example consider l which is "00000" (5x8 bits since 0 is also character). How do I represent l with 00000 (5 bits) instead of a character sequence?

A C or C++ implementation is preferred.

Mark Adler
  • 101,978
  • 13
  • 118
  • 158
AmanSharma
  • 821
  • 9
  • 15
  • 1
    Please read about [how to ask good questions](http://stackoverflow.com/help/how-to-ask), as well as [this question checklist](https://codeblog.jonskeet.uk/2012/11/24/stack-overflow-question-checklist/). And of course please learn how to create a [mcve]. Lastly, please pick *one* language. C and C++ are two *very* different languages. – Some programmer dude Nov 28 '18 at 09:23
  • 2
    Which 3 bit CPU did you have in mind to use for this? – Lundin Nov 28 '18 at 09:29
  • Use a bit-field , if you wish!! – Sourav Ghosh Nov 28 '18 at 09:31
  • It seems you need to study Huffman coding algorithm in some book or good article, then try to implement. Note that length of codes is variable, so you cannot guarantee 3 bits length (excluding case of 5 possible chars) – MBo Nov 28 '18 at 09:38
  • This sounds a lot like a homework assignment. A good way to do this is to separate your code into two pieces: the first returns lists of 1s and 0s, which the second then compresses into actual bits. Use the left-shift operator `<<` and OR assignment operator `|=` to push bits onto a variable. [C Bitwise Operations](https://en.wikipedia.org/wiki/Bitwise_operations_in_C) – Billy Brown Nov 28 '18 at 09:45
  • @BillyBrown Actually i created minor project and built hoffman codes and i want to built its real model like a compressor but how to reduce size of file that is main hurdle. Can u help me to do that? – AmanSharma Nov 28 '18 at 11:15
  • @SouravGhosh Thanks mate, but how to specify it during runtime. – AmanSharma Nov 28 '18 at 11:19
  • @AmanSharma Ok, I was just checking, as I had it as an assignment at university. I have an answer that, when given an array of 1 and 0 ints, compresses them into bits, but I cannot yet post it. – Billy Brown Nov 28 '18 at 12:01
  • You manufacture a stream of bits out of some integer type (`unsigned char`, `unsigned int`, or other type you choose). As you produce new bits, you append them to your stream, using bit operations to position them within the larger integer type. At times this will require splitting them across multiple units. For example, the first three bits would go into bits 7 to 5 of an `unsigned char`, then the next four in bits 4 to 1, then the next three in bit 0 of the first `unsigned char` and bits 7 to 6 of a new `unsigned char`. You will have to write code to implement this. – Eric Postpischil Nov 28 '18 at 12:23
  • Sorry about all the downvotes. This is a fine question. To make a variable that holds a variable number of bits, we just use use the lower bits of one `unsigned int` to store the bits, and use another `unsigned int` to remember how many bits we have stored. When writing out a Huffman-compressed file, we wait until we have at least 8 bits stored. Then we write out a `char` using the top 8 bits and subtract 8 from the stored bit count. – Matt Timmermans Nov 28 '18 at 13:37
  • @EricPostpischil Thanx for this technique but if you can proved a link to some code using this technique it would be a great help. – AmanSharma Nov 28 '18 at 14:05
  • @matttimmermans I thank You Sir for support and i actually couldnt get why people downvoted.Also can you suggest me some code using that technique i would be indebted to You for YOur effort. – AmanSharma Nov 28 '18 at 14:06
  • @AmanSharma [here is a gist](https://gist.github.com/Druid-of-Luhn/6229cdbbc4bd2bc879a68e615a40ed60) that will compress an array of bits into `char`s, and pad the right-hand-side with 0s if the data doesn't fit cleanly. – Billy Brown Nov 28 '18 at 14:37

2 Answers2

3

Now that this question is re-opened...

To make a variable that holds a variable number of bits, we just use use the lower bits of one unsigned int to store the bits, and use another unsigned int to remember how many bits we have stored.

When writing out a Huffman-compressed file, we wait until we have at least 8 bits stored. Then we write out a char using the top 8 bits and subtract 8 from the stored bit count.

Finally, at the end if you have any bits left to write out, you round up to an even multiple of 8 and write chars.

In C++, it's useful to encapsulate the output in some kind of BitOutputStream class, like:

class BitOutputStream
{
    std::ostream m_out;
    unsigned m_bitsPending;
    unsigned m_numPending;

    public:
    BitOutputStream(const char *fileName)
        :m_out(... /* you can do this part */)
    {
        m_bitsPending = 0;
        m_numPending = 0;
    }

    // write out the lower <count> bits of <bits>
    void write(unsigned bits, unsigned count)
    {
       if (count > 16)
       {
           //do it in two steps to prevent overflow
           write(bits>>16, count-16);
           count=16;
       }
       //make space for new bits
       m_numPending += count;
       m_bitsPending <<= count;

       //store new bits
       m_bitsPending |= (bits & ((1<<count)-1));

       //write out any complete bytes
       while(m_numPending >= 8)
       {
           m_numPending-=8;
           m_out.put((char)(m_bitsPending >> m_numPending));
       }
    }

    //write out any remaining bits
    void flush()
    {
        if (m_numPending > 0)
        {
            m_out.put((char)(m_bitsPending << (8-m_numPending)));
        }
        m_bitsPending = m_numPending = 0;
        m_out.flush();
    }
}
Matt Timmermans
  • 53,709
  • 3
  • 46
  • 87
1

If your Huffman coder returns an array of 1s and 0s representing the bits that should and should not be set in the output, you can shift these bits onto an unsigned char. Every eight shifts, you start writing to the next character, ultimately outputting an array of unsigned char. The number of these compressed characters that you will output is equal to the number of bits divided by eight, rounded up to the nearest natural number.

In C, this is a relatively simple function, consisting of a left shift (<<) and a bitwise OR (|). Here is the function, with an example to make it runnable. To see it with more extensive comments, please refer to this GitHub gist.

#include <stdlib.h>
#include <stdio.h>

#define BYTE_SIZE 8

size_t compress_code(const int *code, const size_t code_length, unsigned char **compressed)
{
    if (code == NULL || code_length == 0 || compressed == NULL) {
        return 0;
    }
    size_t compressed_length = (code_length + BYTE_SIZE - 1) / BYTE_SIZE;
    *compressed = calloc(compressed_length, sizeof(char));
    for (size_t char_counter = 0, i = 0; char_counter < compressed_length && i < code_length; ++i) {
        if (i > 0 && (i % BYTE_SIZE) == 0) {
            ++char_counter;
        }
        // Shift the last bit to be set left by one
        (*compressed)[char_counter] <<= 1;
        // Put the next bit onto the end of the unsigned char
        (*compressed)[char_counter] |= (code[i] & 1);
    }
    // Pad the remaining space with 0s on the right-hand-side
    (*compressed)[compressed_length - 1] <<= compressed_length * BYTE_SIZE - code_length;
    return compressed_length;
}

int main(void)
{
    const int code[] = { 0, 1, 0, 0, 0, 0, 0, 1,   // 65: A
                         0, 1, 0, 0, 0, 0, 1, 0 }; // 66: B
    const size_t code_length = 16;
    unsigned char *compressed = NULL;
    size_t compressed_length = compress_code(code, code_length, &compressed);
    for (size_t i = 0; i < compressed_length; ++i) {
        printf("%c\n", compressed[i]);
    }
    return 0;
}

You can then just write the characters in the array to a file, or even copy the array's memory directly to a file, to write the compressed output.

Reading the compressed characters into bits, which will allow you to traverse your Huffman tree for decoding, is done with right shifts (>>) and checking the rightmost bit with bitwise AND (&).

Billy Brown
  • 2,272
  • 23
  • 25
  • 1
    Mixing floating-point into code just to calculate an integer ceiling is not good. It incurs unnecessary costs of conversion to floating-point, floating-point division, and conversion back. Unless there is a danger of overflow, one can simply calculate `(code_length + BYTE_SIZE - 1) / BYTE_SIZE`, which should compile to an integer add and a shift. – Eric Postpischil Nov 28 '18 at 17:13