2

I have a binary file compressed in gz which I wish to stream using boost::iostream. After searching the web the past few hours, I have found a nice code snippet that does what I want, except for std::getline:

try 
{
    std::ifstream file("../data.txt.gz", std::ios_base::in | std::ios_base::binary);
    boost::iostreams::filtering_istream in;
    in.push(boost::iostreams::gzip_decompressor());
    in.push(file);
    std::vector<std::byte> buffer;
    for(std::string str; std::getline(in, str); )
    {
        std::cout << "str length: " << str.length() << '\n';
        for(auto c : str){
            buffer.push_back(std::byte(c));
        }
        std::cout << "buffer size: " << buffer.size() << '\n';
        // process buffer 
        // ...
        // ...
    }
}
catch(const boost::iostreams::gzip_error& e) {
        std::cout << e.what() << '\n';
}

I want to read the file, store it into some intermediary buffer, and fill up the buffer as I stream the file. However, std::getline uses \n delimiter, and when it does, does not include the delimiter in the output string.

Is there a way I could read, for instance, 2048 bytes of data at a time?

Aamir
  • 1,974
  • 1
  • 14
  • 18
MoneyBall
  • 2,343
  • 5
  • 26
  • 59
  • Have you tried using [`std::copy_n`](https://en.cppreference.com/w/cpp/algorithm/copy_n) with a stream iterator? – Captain Obvlious Jan 05 '23 at 05:31
  • @CaptainObvlious would i replace `std::getline` with `std::copy_n`? I'm not familiar with copying data from stream iterator. – MoneyBall Jan 05 '23 at 05:44
  • 2
    Actually you may not be able to as the stream iterators use `>>` for input. Either way you may be better off just doing `in.read(..., 2048)` into your vector. – Captain Obvlious Jan 05 '23 at 05:54
  • @CaptainObvlious I'm not too familiar with stream, could you kindly provide an example that uses `in.read` ? – MoneyBall Jan 05 '23 at 07:14

1 Answers1

1

Uncompressing the gzip stream the way you want isn't exactly straight forward. One option is using boost::iostreams::copy to uncompress the gzip stream into the vector but since you are wanting to decompress the stream in chunks (2k mentioned in your post) that may not be an option.

Now normally with an input stream it's as simple as calling the read() function on the stream specifying the buffer and number of bytes to read in and then calling gcount() to determine how many bytes were actually read. Unfortunately it seems that there is either bug in either filtering_istream or gzip_decompressor or possibly that gcount is not supported (it should be) as it always seems to return the number of bytes requested instead of the actual bytes read. As you might imagine this can cause problems when reading the last few bytes of the file unless you know ahead of time how many bytes to read.

Fortunately the size of the uncompressed data is stored at the end of the gzip file which means we can account for that but we just have to work a little bit harder in the decompression loop.

Below is the code I came up with to handle uncompressing the stream in the way you would like. It creates a two vectors - one for decompressing each 2k chunk and one for the final buffer. It's quite basic and I haven't done anything to really optimize memory usage on the vectors but if that's an issue I suggest switching to a single vector, resize it to the length of the uncompressed data, and call read passing an offset into the vector data for the 2k chunk being read.

#include <boost/iostreams/filtering_stream.hpp>
#include <boost/iostreams/filter/gzip.hpp>
#include <fstream>
#include <iostream>
#include <utility>

int main()
{
    namespace io = boost::iostreams;

    std::ifstream file("../data.txt.gz", std::ios_base::in | std::ios_base::binary);

    // Get the uncompressed size (stored in big endian, assume we're BE)
    uint32_t dataLeft;
    file.seekg(-4, std::ios_base::end);
    file.read(reinterpret_cast<char*>(&dataLeft), sizeof(dataLeft));
    file.seekg(0);

    // Set up the gzip stream
    io::filtering_istream in;
    in.push(io::gzip_decompressor());
    in.push(file);

    std::vector<std::byte> buffer, tmp(2048);
    for (auto toRead(std::min(tmp.size(), dataLeft));
        dataLeft && in.read(reinterpret_cast<char*>(tmp.data()), toRead);
        dataLeft -= toRead, toRead = std::min(tmp.size(), dataLeft))
    {
        tmp.resize(toRead);
        buffer.insert(buffer.end(), tmp.begin(), tmp.end());
        std::cout << "buffer size: " << buffer.size() << '\n';
    }
}
Captain Obvlious
  • 19,754
  • 5
  • 44
  • 74
  • First off thank you for taking your time to write the code snippet. Couple questions: (i) what is `toRead`? (ii) When I run the code, it won't go inside the for loop, I checked the `dataLeft` which is well above 0. – MoneyBall Jan 06 '23 at 00:58
  • `toRead` determines how many bytes to read in that iteration - either the current size of the `tmp` buffer or the number of bytes left in (`dataLeft`) - whichever is *smaller*. In the example that's the 2048 bytes you wanted to read. If it's not going into the loop make sure that `tmp` is bigger than 0. – Captain Obvlious Jan 06 '23 at 01:01
  • 1
    interesting, there was an issue with `std::min` because `dataLeft` was not `unsigned long` type. So I changed it to `uint64_t` which was the reason why it didn't go inside the for loop. Switching it back to `uint32_t` and using `static_cast` worked! – MoneyBall Jan 06 '23 at 01:08
  • 1
    Yep. the size of the uncompressed data stored at the end of the file is only 4 bytes long so loading it into a uint64 would definitely bork it up. – Captain Obvlious Jan 06 '23 at 01:17
  • One more quick question, I'm reading a file that is 7GB in size but dataLeft only gives me 197704081 which is off by a lot, any reason for this? – MoneyBall Jan 06 '23 at 01:49
  • @MoneyBall Yes. Unfortunately AFAIK gzip only stores a 32bit size (a modulus of the full size) which means you may not be able to work in 2k chunks like you want and may have to drop back to using the `copy` function in boost. Once I get back to the lab this afternoon I'll fire it up and see if what solutions are available for that scenario and update my answer accordingly. – Captain Obvlious Jan 06 '23 at 15:52