1

I'm trying to implement a Huffman compressor in С++.

In brief I have 5 classes:

  1. HuffmanTree - represents a tree structure

  2. TreeNode - represents a tree structure

  3. HuffmanArchiver - compress/decompress etc.

  4. BitStringWrite - writing bits.

  5. BitStringRead - reading bits.

(The full implementation is here: headers and cpp's)

I can build a code table and encode a binary file, but I have some questions about reading/writing and about a decoding phase.

When I do an encoding phase, first of all I'm saving Huffman tree in my new file like the following:

void Archiver::encodeTree(BitStringWrite& bw, TreeNode* node){
    if (node -> isLeaf()) {
        bw.writeBit(1);
        char symb = node->getChar();
        bw.getStream().write(&symb, sizeof(symb));
    }
    else {
        bw.writeBit(0);
        encodeTree(bw, node->getLeftTree());
        encodeTree(bw, node->getRightTree());
    }
}

bw here is an instance of the class BitStringWrite, which is implemented like this:

BitStringWrite::BitStringWrite(std::ostream &_out_f) : _byte(0), _pos(0), _out_f(_out_f) {}

void BitStringWrite::writeBit(bool bit) {
   if (_pos == 8)
       flush();
   if (bit == 1) {
       _byte |= (1 <<   (7 - _pos));
   }
   _pos++;
}

void BitStringWrite::writeByte(char b){
   for(int i = 0; i < 8; i++)
       this -> writeBit((b >> i) & 1); //?????
}

void BitStringWrite::flush() {
   if (_pos != 0) {
       _out_f.write(&_byte, sizeof(char));
       _pos = 0;
       _byte = 0;
   }
}

std::ostream& BitStringWrite::getStream(){
    return _out_f;
}

I'm not sure in my writeByte implementation, but the main question here is why may I want to implement a writeByte function, if I already have istream::write?

For example

>cat test.in aaaabc

The buildTable function will produce: a = 1, b = 010, c = 00 and = 011

(it seems like the last symbol is just a \n).

xxd -b test.out 00000000: 01100011 01100010 00001010 01100001 00101111 11101000 cb.a/. 00000006: 01101100

Note, that an encoded message starts from the last bit of the fifth byte. The first five(almost) bytes are representing a structure tree.

Ok, it seems like the encoding phase is working. Let's now proceed to the decoding phase.

The main function for decoding phase is decompress. It invokes the decodeTree function to decode the Huffman tree, then generates a code table based on this tree and then decodes the text.

The function decodeTree doesn't work properly:

TreeNode* Archiver::decodeTree(BitStringRead& br, TreeNode* cur){
    if (br.readBit()) {
        return new TreeNode(br.readByte(), 0, false, NULL, NULL);
    }
    else {
        TreeNode* left = decodeTree(br, cur-> getLeftTree());
        TreeNode* right = decodeTree(br, cur-> getRightTree());
        decodeTree(br, cur-> getRightTree());
        return new TreeNode(0, 0, false, left, right);
    }
}

I think the main reason is because it can't properly read a tree structure, using br, an instance of a class BitStringRead.

Look how it's implemented inside:

BitStringRead::BitStringRead(std::istream &_in_f) : _pos(8), _in_f(_in_f) {}

bool BitStringRead::readBit() {
    if (_pos == 8) {
        _in_f.read(&_byte, sizeof(char));
        _pos = 0;
    }
    return (_byte >> _pos++) & (char)1;
}

char BitStringRead::readByte() { 
    char sym = (char)0;
    for (int i = 0; i < 8; i++){     
       sym |= ((1 & readBit()) << (i));
    }
    return sym;
 }

Assume, we are in the beginning of a file and I have a byte 0001 0110. I invoke the readBit function for the first time. It reads the first 8 bits. Then I invoke it 3 more times, it does not read anything, but just returns the value of these bits. The first 1 in the string denotes the leaf node and I know that after leaf node there is a symbol, so I read it.

I think it starts reading from the ninth bit, not from the fourth, because of readBit implementation.

False Promise
  • 478
  • 3
  • 6
  • 13
  • Are you going for an educational simplified approach here (that's what it looks like) or a serious implementation? – harold Apr 27 '17 at 15:34
  • @harold First case. – False Promise Apr 27 '17 at 15:36
  • @NegligibleSenescence You really shouldn't start your identifiers with underscores. -- *I think the main reason is because...* -- You're not using your debugger to see what the reason is? – PaulMcKenzie Apr 27 '17 at 15:43
  • @PaulMcKenzie Debugger says the root node has no children and its `is_leaf` variable is equal to `true`... – False Promise Apr 27 '17 at 15:46
  • Anyway, you can completely avoid the need to store the tree by using canonical Huffman codes. In some sense there is less to go wrong then. You can also avoid doing single-bit operations in general (unless there is a code word with length 1 of course) by buffering more bits, enough that 7+length(longest code) is not more than the size of your buffer (this requires a length limit on the codes, which is almost ubiquitous in real implementations). Interacting directly with the underlying stream ignores the bitbuffer so you generally can't do it (you can, if you're careful enough) – harold Apr 27 '17 at 15:48
  • @PaulMcKenzie I noticed that my default constructor for class `HuffmanTree` sets `is_leaf` to `true` and I changed it to `false`. The most confusing part is when I create a tree in main, i.e `HuffmaTree nt;` and trying to print `is_leaf` method of the root node, it prints `0`, but when I pass it to `decompress` and trying to print it again it says it is equal `1`... – False Promise Apr 27 '17 at 17:00
  • @NegligibleSenescence Did you override your class's copy constructor and/or assignment operator? If so, did you copy **all** the members from the source object to destination object? That is one way to mess things up and to have that type of behavior, and that is writing a user-defined copy ctor / assignment op, and failing to copy everything. – PaulMcKenzie Apr 27 '17 at 17:03

0 Answers0