Transform byte array to string while supporting different encodings

Question

Let's say I have read the binary content of a text file into a std::vector<std::uint8_t> and I want to transform these bytes into a string representation.

As long as the bytes are encoded using a single-byte encoding (ASCII for example), a transformation to std::string is pretty straightforward:

std::string transformToString(std::vector<std::uint8_t> bytes)
{
  std::string str;
  
  str.assign(
    reinterpret_cast<std::string::value_type*>(const_cast<std::uint8_t*>(bytes.data())),
    data.size() / sizeof(std::string::value_type)
  );

  return str;
}

As soon as the bytes are encoded in some unicode format, things get a little bit more complicated.

As far as I know, C++ supports additional string types for unicode strings. These are std::u8string for UTF-8, std::u16string for UTF-16 and std::u32string for UTF-32.

Problem 1: Let's say the bytes are encoded in UTF-8. How can I create a std::u8string from these bytes in the first place? Also, how do I know the length of the string since there can be code points encoded in multiple bytes?

Problem 2: I've seen, that UTF-16 and UTF-32 support both big-endian and little-endian byte order. Let's say the bytes are encoded in UTF-16 BE or UTF-16 LE. How can I create a std::u16string from the bytes (and how can I specify the byte order for transformation)? I am looking for something like std::u16string u16str = std::u16string::from_bytes(bytes, byte_order::big_endian);.

Problem 3: Are the previously listed types of unicode string already aware of a byte order mark or does the byte order mark (if present) need to be processed separately? Since the said string types are just char8_t, char16_t and char32_t templated on a std::basic_string, I assume, that processing of a byte order mark is not supported.

Clarification: Please note, that I do not want to do any conversions. Almost every article I found was about how to convert UTF-8 strings to other encodings and vice-versa. I just want to get the string representation of the specified byte array. Therefore, as the user/programmer, I must be aware of the encoding of the bytes to get the correct representation. For example:

The bytes are encoded in UTF-8 (e.g. 41 42 43 (ABC)). I try to transform them to a std::u8string. The transformation was correct, the string is ABC.
The bytes are encoded in UTF-8 (e.g. 41 42 43 (ABC)). I try to transform them to a std::u16string. The transformation fails or the resulting string is not correct.

Personally I would not use the standard library for this and get a dedicated UTF library like [ICU](https://icu.unicode.org/) — NathanOliver, Nov 28 '22 at 17:00
I have heard about ICU but I wanted to clarify, if this is possible using the standard library before considering the use of third party components. — Erik So, Nov 28 '22 at 17:04
It really depends on what you want to do with Unicode. If all you need to do is to read a Unicode string in a known encoding, not process it and output then it the Standard Containers will do this. For all other processing use ICU. The Standard Library containers do not know about BOMs, normal form, graphemes, multiple whitespace characters etc. — Richard Critten, Nov 28 '22 at 17:16

user17732522 · Answer 1 · 2022-11-28T18:20:21.600

Your transformToString is (more or less) correct only if uint8_t is unsigned char, which however is the case on every platform I know.

It is unnecessary to do the multiple casts you are doing. The whole cast sequence is not an aliasing violation only if you are casting from unsigned char* to char* (and char is always the value type of std::string). In particular there is no const involved. I also say "more or less", because while this is probably supposed to work specifically when casting between signed/unsigned variants of the same element type, the standard currently doesn't actually specify the pointer arithmetic on the resulting pointer (which I guess is a defect).

However there is a much safer way that doesn't involve dangerous casts or potential for length mismatch:

str.assign(std::begin(bytes), std::end(bytes));

You can use exactly the same line as above to convert to any other std::basic_string specialization, but the important point is that it will simply copy individual bytes as individual code units, not considering encoding or endianess in any way.

Problem 1: Let's say the bytes are encoded in UTF-8. How can I create a std::u8string from these bytes in the first place? Also, how do I know the length of the string since there can be code points encoded in multiple bytes?

You create the string exactly with the same line I showed above. In this case your approach would be wrong if you just replace str's type because char8_t cannot alias unsigned char and would therefore be an aliasing violation resulting in undefined behavior.

A std::u8string holds a sequence of UTF-8 code units (by convention). To get individual code points you can convert to UTF-32. There is std::mbrtoc32 from the C standard library, which relies on the C locale being set as UTF-8 (and requires conversion back to a char array first) and there is codecvt_utf8<char32_t> from the C++ library, which is however deprecated and no replacement has been decided on yet.

There are no functions in the standard library that actually interpret the sequence of code units in u8string as code points. (e.g. .size() is the number of code units, not the number of code points).

Problem 2: I've seen, that UTF-16 and UTF-32 support both big-endian and little-endian byte order. Let's say the bytes are encoded in UTF-16 BE or UTF-16 LE. How can I create a std::u16string from the bytes (and how can I specify the byte order for transformation)? I am looking for something like std::u16string u16str = std::u16string::from_bytes(bytes, byte_order::big_endian);.

There is nothing like that directly in the standard library. A u16string holds 16bit code units of type char16_t as values. What endianess or in general what representation is used for this type is an implementation detail, but you can expect it to be equal to that of other basic types. Since C++20 there is std::endian to indicate the endianess of all scalar types if applicable and std::byteswap which can be used to swap byte order if the endianess doesn't match the source endianess. However, you would need to manually iterate over the vector and form char16_ts from pairs of bytes by bitwise operations anyway, so I am not sure whether this is all that helpful.

All of the above assumes that the original data is actually UTF-16 encoded. If that is not the case you need to convert from the original encoding to UTF-16 for which there are equivalent functions as in the UTF-32 case mentioned above.

Problem 3: Are the previously listed types of unicode string already aware of a byte order mark or does the byte order mark (if present) need to be processed separately? Since the said string types are just char8_t, char16_t and char32_t templated on a std::basic_string, I assume, that processing of a byte order mark is not supported.

The types simply store sequences of code units. They do not care what they represent (e.g. whether they represent a BOM). Because they store code units, not bytes, the BOM wouldn't have any meaning in processing them anyway.

Transform byte array to string while supporting different encodings

1 Answers1