Let's say I have read the binary content of a text file into a std::vector<std::uint8_t>
and I want to transform these bytes into a string representation.
As long as the bytes are encoded using a single-byte encoding (ASCII for example), a transformation to std::string
is pretty straightforward:
std::string transformToString(std::vector<std::uint8_t> bytes)
{
std::string str;
str.assign(
reinterpret_cast<std::string::value_type*>(const_cast<std::uint8_t*>(bytes.data())),
data.size() / sizeof(std::string::value_type)
);
return str;
}
As soon as the bytes are encoded in some unicode format, things get a little bit more complicated.
As far as I know, C++ supports additional string types for unicode strings. These are std::u8string
for UTF-8, std::u16string
for UTF-16 and std::u32string
for UTF-32.
Problem 1: Let's say the bytes are encoded in UTF-8. How can I create a std::u8string
from these bytes in the first place? Also, how do I know the length of the string since there can be code points encoded in multiple bytes?
Problem 2: I've seen, that UTF-16 and UTF-32 support both big-endian and little-endian byte order. Let's say the bytes are encoded in UTF-16 BE or UTF-16 LE. How can I create a std::u16string
from the bytes (and how can I specify the byte order for transformation)? I am looking for something like std::u16string u16str = std::u16string::from_bytes(bytes, byte_order::big_endian);
.
Problem 3: Are the previously listed types of unicode string already aware of a byte order mark or does the byte order mark (if present) need to be processed separately? Since the said string types are just char8_t
, char16_t
and char32_t
templated on a std::basic_string
, I assume, that processing of a byte order mark is not supported.
Clarification: Please note, that I do not want to do any conversions. Almost every article I found was about how to convert UTF-8 strings to other encodings and vice-versa. I just want to get the string representation of the specified byte array. Therefore, as the user/programmer, I must be aware of the encoding of the bytes to get the correct representation. For example:
- The bytes are encoded in UTF-8 (e.g.
41 42 43
(ABC
)). I try to transform them to astd::u8string
. The transformation was correct, the string isABC
. - The bytes are encoded in UTF-8 (e.g.
41 42 43
(ABC
)). I try to transform them to astd::u16string
. The transformation fails or the resulting string is not correct.