Reading the dynamic bitset written data from file cannot read the correct data

Question

So I have a vector which has three numbers. 65, 66, and 67. I am converting these numbers from int to binary and appending them in a string. the string becomes 100000110000101000011 (65, 66, 67 respectively). I am writing this data into a file through dynamic_bitset library. I have BitOperations class which does the reading and writing into file work. When I read the data from file instead of giving the above bits it gives me these 001100010100001000001 bits.

Here is my BitOperations class:

#include <iostream>
#include <boost/dynamic_bitset.hpp>
#include <fstream>
#include <streambuf>
#include "Utility.h"
using namespace std;
using namespace boost;

template <typename T>
class BitOperations {
private:
    T data;
    int size;
    dynamic_bitset<unsigned char> Bits;
    string fName;
    int bitSize;

public:
    BitOperations(dynamic_bitset<unsigned char> b){
        Bits = b;
        size = b.size();
    }

    BitOperations(dynamic_bitset<unsigned char> b, string fName){
        Bits = b;
        this->fName = fName;
        size = b.size();
    }

    BitOperations(T data, string fName, int bitSize){
        this->data = data;
        this->fName = fName;
        this->bitSize = bitSize;
    }

    BitOperations(int bitSize, string fName){
        this->bitSize = bitSize;
        this->fName = fName;
    }

    void writeToFile(){
        if (data != ""){
            vector<int> bitTemp = extractIntegersFromBin(data);
            for (int i = 0; i < bitTemp.size(); i++){
                Bits.push_back(bitTemp[i]);
            }
        }
        ofstream output(fName, ios::binary| ios::app);
        ostream_iterator<char> osit(output);
        to_block_range(Bits, osit);
        cout << "File Successfully modified" << endl;
    }

    dynamic_bitset<unsigned char> readFromFile(){
        ifstream input(fName);
        stringstream strStream;
        strStream << input.rdbuf();
        T str = strStream.str();

        dynamic_bitset<unsigned char> b;
        for (int i = 0; i < str.length(); i++){
            for (int j = 0; j < bitSize; ++j){
                bool isSet = str[i] & (1 << j);
                b.push_back(isSet);
            }
        }
        return b;
    }
};

And here is the code which calls theses operations:

#include <iostream>
// #include <string.h>
#include <boost/dynamic_bitset.hpp>
#include "Utility/BitOps.h"

int main(){
    vector<int> v;
    v.push_back(65);
    v.push_back(66);
    v.push_back(67);

    stringstream ss;
    string st;
    for (int i = 0; i < v.size(); i++){
        ss = toBinary(v[i]);
        st += ss.str().c_str();
        cout << i << " )" << st << endl;
    }
    // reverse(st.begin(), st.end());
    cout << "Original: " << st << endl;

    BitOperations<string> b(st, "bits2.bin", 7);
    b.writeToFile();
    BitOperations<string>c(7, "bits2.bin");
    boost::dynamic_bitset<unsigned char> bits;
    bits = c.readFromFile();
    string s;
    
    // for (int i = 0; i < 16; i++){
        to_string(bits, s);
        // reverse(s.begin(), s.end());
    // }
    cout << "Decompressed: " << s << endl;
}

What am I doing wrong which results in incorrect behaviour?

EDIT: Here is the extractIntegersFromBin(string s) function.

vector<int> extractIntegersFromBin(string s){

    char tmp;
    vector<int> nums;

    for (int i = 0; s[i]; i++ ){
        nums.push_back(s[i] - '0');
    }

    return nums;
}

Edit 2: Here is the code for toBinary:

stringstream toBinary(int n){
    vector<int> bin, bin2;
    int i = 0;
    while (n > 0){
        bin.push_back(n % 2);
        n /= 2;
        i++;
    }

    // for (int j = i-1; j >= 0; j--){
    //     bin2.push_back(bin[j]);
    // }
    reverse(bin.begin(), bin.end());
    stringstream s;
    for (int i = 0; i < bin.size(); i++){
        s << bin[i];
    }

    return s;
}

What is the definition of `toBinary`? Please provide a [minimal, reproducible example](https://stackoverflow.com/help/minimal-reproducible-example) of the problem you are encountering. — f9c69e9781fa194211448473495534, Jan 04 '21 at 21:45
Same applies for `extractIntegersFromBin` and `to_block_range`. It is very difficult to help you if you don't show the code that is actually causing your problems. — f9c69e9781fa194211448473495534, Jan 04 '21 at 22:26
@f9c69e9781fa194211448473495534 `extractIntegersFromBin` extracts the binary bits from a string and `toBinary` converts an integer into binary. `to_block_range` is a boost built in function. — Muhammad Iqbal, Jan 05 '21 at 02:52
Ok, I see now you are `using namespace boost;` to import `to_block_range` into the global namespace. I still don't think the question can be answered effectively without seeing the definition of `extractIntegersFromBin`. Otherwise we can only guess if this method may or may not return correct results. Please provide a [minimal reproducible example](https://stackoverflow.com/help/minimal-reproducible-example). — f9c69e9781fa194211448473495534, Jan 05 '21 at 09:59
@f9c69e9781fa194211448473495534 I have added the `extractIntegersFromBin` function. check the Edit. — Muhammad Iqbal, Jan 05 '21 at 14:30

f9c69e9781fa194211448473495534 · Accepted Answer · 2021-01-05T19:25:38.943

You are facing two different issues:

The boost function to_block_range will pad the output to the internal block size, by appending zeros at the end. In your case, the internal block size is sizeof(unsigned char)*8 == 8. So if the bit sequence you write to the file in writeToFile is not a multiple of 8, additional 0s will be written to make for a multiple of 8. So if you read the bit sequence back in with readFromFile, you have to find some way to remove the padding bits again.
There is no standard way for how to represent a bit sequence (reference). Depending on the scenario, it might be more convenient to represent the bits left-to-right or right-to-left (or some completely different order). For this reason, when you use different code pieces to print the same bit sequence and you want these code pieces to print the same result, you have to make sure that these code pieces agree on how to represent the bit sequence. If one piece of code prints left-to-right and the other right-to-left, you will get different results.

Let's discuss each issue individually:

Regarding issue 1

I understand that you want to define your own block size with the bitSize variable, on top of the internal block size of boost::dynamic_bitset. For example, in your main method, you construct BitOperations<string> c(7, "bits2.bin");. I understand that to mean that you expect the bit seqence stored in the file to have a length that is some multiple of 7.

If this understanding is correct, you can remove the padding bits that have been inserted by to_block_range by reading the file size and then rounding it down to the nearest multiple of your block size. Though you should note that you currently do not enforce this contract in the BitOperation constructor or in writeToFile (i.e. by ensuring that the data size is a multiple of 7).

In your readFromFile method, first note that the inner loop incorrectly takes the blockSize into account. So if blockSize is 7, this incorrectly only considers the first 7 bits of each block. Whereas the blocks that were written by to_block_range use the full 8 bit of each 1-byte block, since boost::dynamic_bitset does not know anything about your 7-bit block size. So this makes you miss some bits.

Here is one example for how to fix your code:

    size_t bitCount = (str.length()*8) / bitSize * bitSize;
    size_t bitsPerByte = 8;

    for (int i = 0; i < bitCount; i++) {
      size_t index = (i / bitsPerByte);
      size_t offset = (i % bitsPerByte);

      bool isSet = (str[index] & ( 1 << offset));
      b.push_back(isSet);
    }

This example first calculates how many bits should be read in total, by rounding down the file size to the nearest multiple of your block size. It then iterates over the full bytes in the input (i.e. the internal blocks that were written by boost::dynamic_bitset), until the targeted number of bits have been read. The remaining padding bits are discarded.

An alternative method would be to use boost::from_block_range. This allows you to get rid of some boiler plate code (i.e. reading the input into some string buffer):

  dynamic_bitset<unsigned char> readFromFile() {
    ifstream input{fName};

    // Get file size
    input.seekg(0, ios_base::end);
    ssize_t fileSize{input.tellg()};

    // TODO Handle error: fileSize < 0

    // Reset to beginning of file
    input.clear();
    input.seekg(0);

    // Create bitset with desired size
    size_t bitsPerByte = 8;
    size_t bitCount = (fileSize * bitsPerByte) / bitSize * bitSize;
    dynamic_bitset<unsigned char> b{bitCount};

    // TODO Handle error: fileSize != b.num_blocks() * b.bits_per_block / bitsPerByte

    // Read file into bitset
    std::istream_iterator<char> iter{input};
    boost::from_block_range(iter, {}, b);

    return b;
  }

Regarding issue 2

Once you have solved issue 1, the boost::dynamic_bitset that is written to the file by writeToFile will be the same as the one read by readFromFile. If you print both with the same method, the output will match. However, if you use different methods for printing, and these methods do not agree on the order in which to print the bits, you will get different results.

For example, in the output of your program you can now see that the "Original:" output is the same as "Decompressed:", except in reverse order:

Original: 100000110000101000011
...
Decompressed: 110000101000011000001

Again, this does not mean that readFromFile is working incorrectly, only that you are using different ways of printing the bit sequences.

The output for Original: is obtained by directly printing the 0/1 input string in main from left to right. In writeToFile, this string is then decomposed in the same order with extractIntegersFromBin and each bit is passed to the push_back method of boost::dynamic_bitset. The push_back method appends to the end of the bit sequence, meaning it will interpret each bit you pass as more significant than the previous (reference):

Effects: Increases the size of the bitset by one, and sets the value of the new most-significant bit to value.

Therefore, your input string is interpreted such that the first bit in the input string is the least significant bit (i.e. the "first" bit of the sequence), and the last bit of the input string is the most significant bit (i.e. the "last" bit of the sequence).

Whereas you construct the output for "Decompressed:" with to_string. From the documentation of this method, we can see that the least-significant bit of the bit sequence will be the last bit of the output string (reference):

Effects: Copies a representation of b into the string s. A character in the string is '1' if the corresponding bit is set, and '0' if it is not. Character position i in the string corresponds to bit position b.size() - 1 - i.

So the problem is simply that to_string (by design) prints in opposite order compared to the order in which you print the input string manually. So to fix this, you have to reverse one of these, i.e. by printing the input string by iterating over the string in reverse order, or by reversing the output of to_string.

This Works like a charm. but one problem arises. When I push `115 115 108 36 98 113 84 32 32 104 97 105 139` data into vector it cannot decompress correctly. The max here is 139 and takes 8 bits. but When I decompress the file 115 115 108 are printed correctly. but instead of 36 it prints 73. How do I write the digits if `bitSize` is 8 then 36 should also be written in file as `00100000` (write every digit in file = bitSize)? — Muhammad Iqbal, Jan 05 '21 at 20:36
Does the raw bit sequence match for those numbers? I tried it here (with my own implementation of `toBinary`) and the bit sequence that is written to the file matches the one that is read from the file. If that is also the case on your side, then either there is some issue with how you do the decompression (I think it's not shown in the code you have posted so far?), or there is some issue with `toBinary` (I think you also haven't posted the code for that, so I cannot rule that out). — f9c69e9781fa194211448473495534, Jan 05 '21 at 20:51
Mh, the logic of `toBinary` itself looks ok. But I understand it returns a variable-length encoding of the number (i.e. as much bits as are needed to represent the number), as opposed to a fixed-length encoding where every number is encoded with the same number of bits? I was under the impression that you wanted to do fixed-length encoding (hence my assumption that the bit sequence is some multiple of `blockSize`). — f9c69e9781fa194211448473495534, Jan 05 '21 at 21:01
If you want to do a variable-length encoding, it is not clear to me how you want to identify where a number ends in the bit sequence. E.g., if you have the bit sequence "1010", is that a sequence of two numbers (two, two) or one number (ten)? — f9c69e9781fa194211448473495534, Jan 05 '21 at 21:02
I will be passing the `bitSize` of Max(digits) from the vector. So every other number should be of bitSize when converted into binary. — Muhammad Iqbal, Jan 05 '21 at 21:05
Let us [continue this discussion in chat](https://chat.stackoverflow.com/rooms/226896/discussion-between-mib-and-f9c69e9781fa194211448473495534). — Muhammad Iqbal, Jan 05 '21 at 21:22
There is a problem in your readfromfile() function. It appends extra 0's at the end of output. I guess it is the bitCount. How can I tweak this bitCount so instead of appending extra 0's it just give me the original data only? — Muhammad Iqbal, Jan 13 '21 at 18:36

Reading the dynamic bitset written data from file cannot read the correct data

1 Answers1

Regarding issue 1

Regarding issue 2