13

I need to convert between UTF-8, UTF-16 and UTF-32 for different API's/modules and since I know have the option to use C++11 am looking at the new string types.

It looks like I can use string, u16string and u32string for UTF-8, UTF-16 and UTF-32. I also found codecvt_utf8 and codecvt_utf16 which look to be able to do a conversion between char or char16_t and char32_t and what looks like a higher level wstring_convert but that only appears to work with bytes/std::string and not a great deal of documentation.

Am I meant to use a wstring_convert somehow for the UTF-16 ↔ UTF-32 and UTF-8 ↔ UTF-32 case? I only really found examples for UTF-8 to UTF-16, which I am not even sure will be correct on Linux where wchar_t is normally considered UTF-32... Or do something more complex with those codecvt things directly?

Or is this just still not really in a usable state and I should stick with my own existing small routines using 8, 16 and 32bit unsigned integers?

Fire Lancer
  • 29,364
  • 31
  • 116
  • 182
  • `wchar_t` is not "considered for UTF-32". `wchar_t` is used for wide characters. You can convert wide characters to UTF-foo if you like. – Kerrek SB Jul 08 '15 at 20:01
  • I would not bet on any C++ unicode functionality - you may try something like uconv: https://en.wikipedia.org/wiki/Uconv –  Jul 08 '15 at 20:03
  • hence wanting to use u16* u32* types, I only mentioned wchar_t because google examples seem to use it, and because wstring_convert is standard but u16string_convert, u32string_convert, etc. appear to not exist, so does that mean i missed somthing about wstring_convert? – Fire Lancer Jul 08 '15 at 20:07
  • Is uconv just a program implemented by ICU? I suppose one extreme option would be to include the ICU library, but that seems a really big thing to just to pass a string between a few other libraries with almost no other processing. – Fire Lancer Jul 08 '15 at 20:15

1 Answers1

30

If you read the documentation at CppReference.com for wstring_convert, codecvt_utf8, codecvt_utf16, and codecvt_utf8_utf16, the pages include a table that tells you exactly what you can use for the various UTF conversions.

table

And yes, you would use std::wstring_convert to facilitate the conversion between the various UTFs. Despite its name, it is not limited to just std::wstring, it actually operates with any std::basic_string type (which std::string, std::wstring, and std::uXXstring are all based on).

Class template std::wstring_convert performs conversions between byte string std::string and wide string std::basic_string<Elem>, using an individual code conversion facet Codecvt. std::wstring_convert assumes ownership of the conversion facet, and cannot use a facet managed by a locale. The standard facets suitable for use with std::wstring_convert are std::codecvt_utf8 for UTF-8/UCS2 and UTF-8/UCS4 conversions and std::codecvt_utf8_utf16 for UTF-8/UTF-16 conversions.

For example:

typedef std::string u8string;

u8string To_UTF8(const std::u16string &s)
{
    std::wstring_convert<std::codecvt_utf8_utf16<char16_t>, char16_t> conv;
    return conv.to_bytes(s);
}

u8string To_UTF8(const std::u32string &s)
{
    std::wstring_convert<std::codecvt_utf8<char32_t>, char32_t> conv;
    return conv.to_bytes(s);
}

std::u16string To_UTF16(const u8string &s)
{
    std::wstring_convert<std::codecvt_utf8_utf16<char16_t>, char16_t> conv;
    return conv.from_bytes(s);
}

std::u16string To_UTF16(const std::u32string &s)
{
    std::wstring_convert<std::codecvt_utf16<char32_t>, char32_t> conv;
    std::string bytes = conv.to_bytes(s);
    return std::u16string(reinterpret_cast<const char16_t*>(bytes.c_str()), bytes.length()/sizeof(char16_t));
}

std::u32string To_UTF32(const u8string &s)
{
    std::wstring_convert<codecvt_utf8<char32_t>, char32_t> conv;
    return conv.from_bytes(s);
}

std::u32string To_UTF32(const std::u16string &s)
{
    const char16_t *pData = s.c_str();
    std::wstring_convert<std::codecvt_utf16<char32_t>, char32_t> conv;
    return conv.from_bytes(reinterpret_cast<const char*>(pData), reinterpret_cast<const char*>(pData+s.length()));
}
Remy Lebeau
  • 555,201
  • 31
  • 458
  • 770
  • Yes, I saw and mentioned those types. But am I meant to wrap those for std::basic_string types with error and buffer handling myself (which seem to give little direct value over a simple utf-8 & utf-16 encode/decode codepoint function? Like I said, wstring_convert seemed "higher level", but not seeing how to template it for all the applicable cases. – Fire Lancer Jul 08 '15 at 20:10
  • So I looked at it some more, I still do not see how a u16string/UTF-16 and u32string/UTF-32 conversion is meant to work? None of the template instantiations for wstring_convert or codecvt_utf16 seem to take both, but rather always want std::string? – Fire Lancer Jul 08 '15 at 20:25
  • I have added examples to my answer. – Remy Lebeau Jul 08 '15 at 20:35
  • For from_bytes looks like I can just use the "const char* first, const char* last" overload (data() and data()+length() casted to const char), but is nothing possible with to_bytes (To_UTF16)? I guess the copy is not that expensive, but really feels unneeded. – Fire Lancer Jul 08 '15 at 20:46
  • There is no other option for `std::wstring_convert::to_bytes()`. Maybe you can do something with `std::wbuffer_convert` to avoid unwanted copies. Otherwise, just implement the UTF32->UTF16 conversion manually, it would only take a few extra lines of code. – Remy Lebeau Jul 08 '15 at 22:03
  • ok, well still a nice improvement over what I had in my old C++03 project although not immediately obvious without some helper wrappers along with the new string types and literals. – Fire Lancer Jul 08 '15 at 22:17
  • Can you explain why use used `reinterpret_cast` instead of `static_cast`? – kevinarpe Sep 29 '16 at 08:05
  • 1
    @kevinarpe because `static_cast` cannot cast between pointers of *unrelated* types, but `reinterpret_cast` can. – Remy Lebeau Sep 29 '16 at 08:14
  • does this code compile in c++20 or 23? – Nguyen Manh Oct 03 '22 at 04:55