How to use std::string to store bytes (unsigned chars) in a right way?

Question

I'm coding LZ77 compression algorithm, and I have trouble storing unsigned chars in a string. To compress any file, I use its binary representation and then read it as chars (because 1 char is equal to 1 byte, afaik) to a std::string. Everything works perfectly fine with chars. But after some time googling I learned that char is not always 1 byte, so I decided to swap it for unsigned char. And here things start to get tricky:

When compressing plain .txt, everything works as expected, I get equal files before and after decompression (I assume it should, because we basically work with text before and after byte conversion)
However, when trying to compress .bmp, decompressed file loses 3 bytes compared to input file (I lose these 3 bytes when trying to save unsigned chars to a std::string)

So, my question is – is there a way to properly save unsigned chars to a string?

I tried to use typedef basic_string<unsigned char> ustring and swap all related functions for their basic alternatives to use with unsigned char, but I still lose 3 bytes.

UPDATE: I found out that 3 bytes (symbols) are lost not because of std::string, but because of std::istream_iterator (that I use instead of std::istreambuf_iterator) to create string of unsigned chars (because std::istreambuf_iterator's argument is char, not unsigned char)

So, are there any solutions to this particular problem?

Example:

std::vector<char> tempbuf(std::istreambuf_iterator<char>(file), {}); // reads 112782 symbols

std::vector<char> tempbuf(std::istream_iterator<char>(file), {}); // reads 112779 symbols

Sample code:

void LZ77::readFileUnpacked(std::string& path)

{


std::ifstream file(path, std::ios::in | std::ios::binary);

if (file.is_open())
{
    // Works just fine with char, but loses 3 bytes with unsigned
    std::string tempstring = std::string(std::istreambuf_iterator<char>(file), {});
    file.close();
}
else
    throw std::ios_base::failure("Failed to open the file");
}

@Eljay I read that on some platforms char can not be equal to 1 byte (because apparently Standard doesn't define its exact size). Anyway, what are your thoughts on storing byte data using regular char in general? Is it optimal enough? — asymmetriq, Nov 30 '19 at 16:20
@IgorR. It's the worst case scenario, because it will require to rewrite most of algorithm's logic — asymmetriq, Nov 30 '19 at 16:22
I'd use `std::vector` to store the byte data. If you want to avoid rewriting that much code, then `std::basic_string` (may have to provide your own traits, too). — Eljay, Nov 30 '19 at 16:22
It's not just storing – I need to compress this data, and `std::byte` doesn't provide enough interface (also I read that `std::byte` is `unsigned char` in disguise) — asymmetriq, Nov 30 '19 at 16:24
I think what you're looking for is std::vector. I'm uncertain support of uint8_t is guaranteed by the standard, but if it is, it will have exactly 8 bits. — Uri Raz, Nov 30 '19 at 16:26
You don't want to use `std::string` - in many places it is assumed that strings are null-terminated and this isn't what you want. Use `std::vector`. There is no need to worry over `char` being not 8 bits. Such systems are rare and most surely something else is not going to work on it besides your small archive code. — ALX23z, Nov 30 '19 at 16:56
@ALX23z slim chances unsigned char would be different from uint8_t, but as a matter of software engineering its better to explicitly state to the compiler what one is trying to achieve, so on the rare occasion one would get a clear compilation error, rather than unclear run time error. — Uri Raz, Nov 30 '19 at 17:11
@asymmetriq "*I read that on some platforms char can not be equal to 1 byte (because apparently Standard doesn't define its exact size)*" - the standard does explicitly say that `char` is one byte in size (that `sizeof(char)` always returns 1). What it does not say is how large a byte is. And yes, there are platforms (albeit rare nowadays) where a byte is not 8 bits in size. See `CHAR_BIT` for the actual size on a given platform — Remy Lebeau, Nov 30 '19 at 17:18

Nicol Bolas · Accepted Answer · 2019-11-30T16:56:35.647

char in all of its forms (and std::byte, which is isomorphic with unsigned char) is always the smallest possible type that a system supports. The C++ standard defines that sizeof(char) and its variations shall always be exactly 1.

"One" what? That's implementation-defined. But every type in the system will be some multiple of sizeof(char) in size.

So you shouldn't be too concerned over systems where char is not one byte. If you're working under a system where CHAR_BITS isn't 8, then that system can't handle 8-bit bytes directly at all. So unsigned char won't be any different/better for this purpose.

As to the particulars of your problem, istream_iterator is fundamentally different from istreambuf_iterator iterator. The purpose of the latter is to allow iterator access to the actual stream as a sequence of values. The purpose of istream_iterator<T> is to allow access to a stream as if by performing a repeated sequence of operator >> calls with a T value.

So if you're doing istream_iterator<char>, then you're saying that you want to read the stream as if you did stream >> some_char; variable for each iterator access. That isn't actually isomorphic with accessing the stream's characters directly. Specifically, FormattedInputFunctions like operator>> can do things like skip whitespace, depending on how you set up your stream.

To clarify a bit... Section 4.4 of C++17 (IIRC this is new to this standard) says "The fundamental storage unit in the C++ memory model is the byte. A byte is at least large enough to contain any member of the basic execution character set (5.3) and the eight-bit code units of the Unicode UTF-8 encoding form and is composed of a contiguous sequence of bits, the number of which is implementation defined." So I take it that, starting with C++17, CHAR_BITS must be at least 8. — Uri Raz, Nov 30 '19 at 16:40
@UriRaz -- `CHAR_BITS` has always had to be at least 8 (look up its definition in the library portion of the C standard). But it can (and sometimes is) more than 8. On some (ancient) systems the smallest addressable unit of storage was 9 bits wide (so `CHAR_BIT` would be 9), and integers were 36 bits. On some modern systems (specifically, DSPs) the smallest addressable unit of storage is 32 bits wide, so `CHAR_BIT` is 32. — Pete Becker, Nov 30 '19 at 16:46

score 1 · Answer 2 · answered Nov 30 '19 at 16:56

1

istream_iterator is reading using operator>> which usually skip white spaces as part of its function. If you want to disable that behavior, you'll have to do

#include <ios>

file >> std::noskipws;

answered Nov 30 '19 at 16:56

AProgrammer

51,233
8
91
143

How to use std::string to store bytes (unsigned chars) in a right way?

2 Answers2