Packing chars into 5 bits and writing results to file (C++)

Question

I have a vector containing chars. These chars can only be the 26 upper-case letters of the alphabet, hence the number of bits representing these characters can be reduced from 8 to 5. I then need to write the results into a file, to be used later.

My current thinking is that the 3 most significant bits are all the same for A..Z, hence I could use the 5 least significant bits to uniquely identify the characters? However I am struggling to write this unformatted data to a file.

How would I go about doing this and writing the result to a file?

At the moment your question is a bit too broad for the stackoverflow format. If you could add the code that you've tried so far with some details of what is going wrong I'm pretty sure someone will be able to help you. — PeterSW, Apr 30 '14 at 17:53
You don't shift to get the low order bits, you mask: `ch & 0x1F`. Once you've got that, it gets more difficult, since you'll have to shift, mask and or them into the final results. — James Kanze, Apr 30 '14 at 18:16
@JamesKanze - I edited my question as I did not mean to say shift. Thanks for pointing this out — Craig, Apr 30 '14 at 18:22
`A-Z` (which are `65-90`) **DO NOT** have the same significant 3 bits. `A` is `10000001` whereas `Z` is `01011010`. `100` is not the same as `010`, so masking by `0x1F` (`00011111`) to remove the high bits will lose data. What you can do, however, is subtract 65 from `A-Z` to make `0-25`, which will still produce 5bit values (25 is `11001`). Write those reduced values as needed. When reading the values back, simply add 65 to convert `0-25` back to `A-Z`. — Remy Lebeau, Apr 30 '14 at 18:37
@Craig: You are right. My calculator was not displaying the leading 0, so it threw me off. But what I said about subtracting/adding 65 still stands. That is the safer way to go. The end result is like what you are thinking. — Remy Lebeau, Apr 30 '14 at 19:55
@RemyLebeau One or the other. Neither will work with EBCDIC, but that's likely not to be a problem. (I mentioned `&` because the original question was expressed in terms of bits and bitwise operators. In practice, I'd probably subtract `'A'`. And not worry about the encoding details.) — James Kanze, May 01 '14 at 08:17

score 1 · Answer 1 · answered May 01 '14 at 08:37

To reduce the character to 5 bits, you can use either ch & 0x1F or ch - 'A'; neither will work with EBCDIC, but that's likely not an issue. (If it is: a table lookup in a string of all of the capital letters, returning the index, can be used.)

After that, it gets complicated. The simplest solution is to define a bit array, something like:

class BitArray
{
    std::vector<unsigned char> myData;
    int byteIndex( int index ) { return index / 8; }
    unsigned char bitMask( int index ) { return 1 << (index % 8); }
    int byteCount( int bitCount )
    { 
        return byteIndex( bitCount )
            + (bitIndex( bitCount) != 0 ? 1 : 0);
    }
public:
    BitArray( int size ) : myData( byteCount( size ) ) {}
    void set( index )
    {
        myData[byteIndex( index )] |= bitMask( index );
    }
    void reset( index )
    {
        myData[byteIndex( index )] &= ~bitMask( index );
    }
    bool test( index )
    {
        return (myData[byteIndex( index ) & bitMask( index )) != 0;
    }
};

(You'll need more to extract the data, but I'm not sure in what format you need it.)

You then loop over your string:

BitArray results( 5 * s.size() );
for ( int index = 0; index != s.size(); ++ index ) {
    for ( int pos = 0; pos != 5; ++ pos ) {
        results.set( 5 * index + pos );
    }
}

This will work without problems. When I tried using it (or rather the equivalent) in the distant past (for Huffman encoding, in C, since this was in the 1980's), it was also way too slow. If your strings are fairly short, today, it may be sufficient. Otherwise, you'll need a more complicated algorithm, which keeps track of how many bits are already used in the last byte, and does the appropriate shifts and masks to insert as many bits as possible in one go: at most two shift and or operations per insertion, rather than 5 as is the case here. This is what I ended up using. (But I don't have the code anymore, so I can't easily post an example.)

Remy Lebeau · Answer 2 · 2014-04-30T18:29:30.987

0

The smallest unit of data that you can work with is 8 bits. You will have to employ bit shifts, but you can only read/write data in groups of 8 bits, so you are going to need extra logic to handle that. If your input has at least 8 5bit letters, merge 8 letters at a time together to make a total of 40 bits and write that out to file as 5 8bit bytes. Continue as needed until you have less than 8 5bit letters left, then merge them together and pad the remainder to an even multiple of 8 and write that out to file.

edited Apr 30 '14 at 18:29

answered Apr 30 '14 at 17:54

Remy Lebeau

555,201
31
458
770

So would a solution be to merge a bunch of 5 bit bitsets, and pad the result so the size is a multiple of 8 and finally write 8 bits at a time to the output? – Craig Apr 30 '14 at 18:02
Yes, that is what you will have to do. – Remy Lebeau Apr 30 '14 at 18:14
@RemyLebeau I'm not sure that you could use `std::bitset`, or at least that it would buy you anything. `std::bitset` must have a constant length, so you cannot simply concatenate a sequence of `std::bitset<5>`. (If you know the number of characters in advance, of course, you could create a `std::bitset<5 * numChars>`, and work with that. But that only works if the number of characters is a compile time constant.) – James Kanze May 01 '14 at 08:21

Sergey Kalinichenko · Answer 3 · 2014-04-30T18:29:21.823

0

I have a vector [of chars that] can only be the 26 upper-case letters of the alphabet

You can code it up relatively easily: split the text into eight-character blocks, and write the encoded text into five-byte blocks, like this:

          76543210 76543210 76543210 76543210 76543210 76543210 76543210 76543210
ORIGINAL: 000AAAAA 000BBBBB 000CCCCC 000DDDDD 000EEEEE 000FFFFF 000GGGGG 000HHHHH

          76543210 76543210 76543210 76543210 76543210
ENCODED:  AAAAABBB BBCCCCCD DDDDEEEE EFFFFFGG GGGHHHHH

If you do not have enough characters for your last block, use a "pad" character (all ones) which is not used for encoding any of the 26 letters.

edited Apr 30 '14 at 18:29

answered Apr 30 '14 at 18:14

Sergey Kalinichenko

714,442
84
1,110
1,523

That doesn't sound at all like what he is trying to do. – James Kanze Apr 30 '14 at 18:17
Base-32 operates on 5bit groups. What Craig is doing is not exactly Base-32, but ideals from the Base-32 algorithm can be adapted to what Craig needs. The downside is that Base-32 can increase the output size. 5 input bits takes up 8 output bits, up to 40 input bits for 40 output bits, so if the input is not evenly divisible by 40 bits then the output will be slightly more bits than the input due to padding. – Remy Lebeau Apr 30 '14 at 18:18
while this is not exactly what I need, the theory of this and @RemyLebeau to group it into 8 bits has helped. Thanks – Craig Apr 30 '14 at 18:24
@dasblinkenlight: No, the purpose of Base-32 is to convert arbitrary 8bit data into a 7bit compatible format using 5bit indexes into a table of 7bit compatible characters. Craig's data is not 8bit, it is 5bit. So he can use his existing 5bit data as-is for the indexes. It sounds more like Craig just wants to write the 5bit data as-is in an 8bit storage medium, and that is not what Base-32 is for. – Remy Lebeau Apr 30 '14 at 18:25

score 0 · Answer 4 · answered Apr 30 '14 at 18:25

0

Can you do it? Sure.

I think you'd have more success and ease just using gzip to write a compressed file.

answered Apr 30 '14 at 18:25

patros

7,719
3
28
37

score 0 · Answer 5 · answered May 08 '14 at 12:58

You can give my PackedArray code a try.

It implements a random access container where items are packed at the bit-level. In other words, it acts as if you were able to manipulate a e.g. uint9_t or uint17_t array:

PackedArray principle:
  . compact storage of <= 32 bits items
  . items are tightly packed into a buffer of uint32_t integers

PackedArray requirements:
  . you must know in advance how many bits are needed to hold a single item
  . you must know in advance how many items you want to store
  . when packing, behavior is undefined if items have more than bitsPerItem bits

PackedArray general in memory representation:
  |-------------------------------------------------- - - -
  |       b0       |       b1       |       b2       |
  |-------------------------------------------------- - - -
  | i0 | i1 | i2 | i3 | i4 | i5 | i6 | i7 | i8 | i9 |
  |-------------------------------------------------- - - -

  . items are tightly packed together
  . several items end up inside the same buffer cell, e.g. i0, i1, i2
  . some items span two buffer cells, e.g. i3, i6

Packing chars into 5 bits and writing results to file (C++)

5 Answers5