35

C++20 added char8_t and std::u8string for UTF-8. However, there is no UTF-8 version of std::cout and OS APIs mostly expect char and execution character set. So we still need a way to convert between UTF-8 and execution character set.

I was rereading a char8_t paper and it looks like the only way to convert between UTF-8 and ECS is to use std::c8rtomb and std::mbrtoc8 functions. However, their API is extremely confusing. Can someone provide an example code?

  • 1
    "_However, there is no UTF-8 version of std::cout_" – [`std::wcout`](https://stackoverflow.com/questions/32338496/what-is-the-difference-between-stdcout-and-stdwcout)? – TrebledJ Apr 07 '19 at 06:23
  • 9
    @TrebledJ This one uses wide execution character set. –  Apr 07 '19 at 06:24
  • Are you sure that execution character set is not utf8 and that you really need to convert? Which OS is this? – user7860670 Apr 07 '19 at 06:40
  • 3
    On Mac and Linux you can generally just print utf-8 directly (I think there are some rare Linux distributions where this isn't the case) on windows you should convert to `wchar_t` and use `wcout`. – Alan Birtles Apr 07 '19 at 06:44
  • 7
    I want to stay cross-platform so I can't really assume what ECS is. My main platform is Linux so in practice I myself would have lossless conversion but I still want to stay within cross-platform boundaries. –  Apr 07 '19 at 06:52
  • @Lyberta writing cross-platform code doesn't always mean using the same code on all platforms. Sometimes you need to `#ifdef` things. And Windows is notoriously different than most other platforms when it comes to Unicode handling in C/C++ strings. That is partly why `char(8|16|32)_t` were added in the first place. It is not uncommon to convert strings to UTF-16 on Windows and UTF-8 on other platforms when interfacing with OS components, like the console. – Remy Lebeau Apr 09 '19 at 05:53
  • @RemyLebeau I agree that Windows is its own special snowflake but I've heard that support for UTF-8 was improved in Windows 10. I don't have a Windows machine to test but I expect if I ever need to code something for Windows, by that time it will be possible to set ECS to UTF-8 and provide lossless conversion. –  Apr 09 '19 at 12:43
  • 6
    @Lyberta "I've heard that support for UTF-8 was improved in Windows 10*" - somewhat. Microsoft now allows UTF-8 to be used as the user's ANSI locale (but that feature is currently still in beta), such that you can now use UTF-8 strings in ANSI APIs, for instance. But the OS and Unicode APIs are still based on UTF-16 and that is what you should stick with for best performance. – Remy Lebeau Apr 09 '19 at 15:21
  • AFAIK, as of today (2020 Mar 17) `std::c8rtomb` and `std::mbrtoc8`, are **not** yet in any of the three compilers. This is what `` should contain. – Chef Gladiator Mar 17 '20 at 10:26

7 Answers7

34

UTF-8 "support" in C++20 seems to be a bad joke.

The only UTF functionality in the Standard Library is support for strings and string_views (std::u8string, std::u8string_view, std::u16string, ...). That is all. There is no Standard Library support for UTF coding in regular expressions, formatting, file i/o and so on.

In C++17 you can--at least--easily treat any UTF-8 data as 'char' data, which makes usage of std::regex, std::fstream, std::cout, etc. possible without loss of performance.

In C++20 things will change. You cannot longer write for example std::string text = u8"..."; It will be impossible to write something like

std::u8fstream file; std::u8string line; ... file << line;

since there is no std::u8fstream.

Even the new C++20 std::format does not support UTF at all, because all necessary overloads are simply missing. You cannot write

std::u8string text = std::format(u8"...{}...", 42);

To make matters worse, there is no simple casting (or conversion) between std::string and std::u8string (or even between const char* and const char8_t*). So if you want to format (using std::format) or input/output (std::cin, std::cout, std::fstream, ...) UTF-8 data, you have to internally copy all strings. - That will be an unnecessary performance killer.

Finally, what use will UTF have without input, output, and formatting?

Galik
  • 47,303
  • 4
  • 80
  • 117
CAF
  • 1,090
  • 9
  • 6
  • 1
    `printf( "%s", (char const *)u8"ひらがな" );` -- this works for printing. For now. – Chef Gladiator Nov 29 '19 at 12:25
  • 2
    @ChefGladiator does it? How do you know the program at the other end of the printf actually decodes your stream as utf8? – spectras Mar 16 '20 at 09:39
  • 1
    @spectras my point exactly. To be explicit: I am perfectly aware prog, lang. does not work on assumptions. Or hacks. Or undocumented compiler behavior. – Chef Gladiator Mar 17 '20 at 10:14
  • Also, in case it has missed attention, I am perfectly in tune with this answer, I have upvoted it. Here is [my post](https://dbj.org/c-char8_t-is-broken/) -- Basically in C++20 we have a UTF-8 key word (type) `char8_t` which can not hold all of UTF-8 glyphs. It can hold code points in range `0 .. 7F`, only . – Chef Gladiator Mar 17 '20 at 10:18
  • 2
    @ChefGladiator `charN_t` types are horribly named because they have nothing to do with "characters". Mapping Unicode Scalar Values to glyphs is also very non-trivial. I'm working on a [proposal with proper names and strong types](https://github.com/Lyberta/cpp-unicode). –  Mar 18 '20 at 02:49
  • The magic of UTF8 is that it is ANSI backward compatible unless you do linguistic stuff like needing word breaking and space separators but this soon becomes very complex and target for ICU anyway. just cast u8 to const char* in format and it all goes well. I know it sounds bad (and it feels bad) but it's good enough to not delay std::format and others even more. – Lothar Aug 26 '20 at 09:00
  • 1
    What a shame utf8 cannot be integrated directly into std::string. That's one of the many reasons I have a neat preference for Qt over the standard library (and the STL). – Kiruahxh Oct 27 '20 at 20:27
  • damn, i'm using c++23 and i'm stuck on this. – Nguyen Manh Oct 03 '22 at 04:42
  • It is true that there is much work needed in the C and C++ standard libraries to improve support for UTF-8. Help is welcome! [SG16](https://github.com/sg16-unicode/sg16) is an open group and invites participation! – Tom Honermann Nov 19 '22 at 19:09
6

At present, std::c8rtomb and std::mbrtoc8 are the the only interfaces provided by the standard that enable conversion between the execution encoding and UTF-8. The interfaces are awkward. They were designed to match pre-existing interfaces like std::c16rtomb and std::mbrtoc16. The wording added to the C++ standard for these new interfaces intentionally matches the wording in the C standard for the pre-existing related functions (hopefully these new functions will eventually be added to C; I still need to pursue that). The intent in matching the C standard wording, as confusing as it is, is to ensure that anyone familiar with the C wording recognizes that the char8_t interfaces work the same way.

cppreference.com has some examples for the UTF-16 versions of these functions that should be useful for understanding the char8_t variants.

Tom Honermann
  • 1,774
  • 1
  • 7
  • 10
  • In the meantime cppreference has added the full spec of [``](https://en.cppreference.com/w/cpp/header/cuchar), as every C++20 std lib should contain. – Chef Gladiator Mar 17 '20 at 10:29
  • 1
    The `c8rtomb` and `mbrtoc8` interfaces have also been added to C23 via the adoption of [N2653](https://www.open-std.org/jtc1/sc22/wg14/www/docs/n2653.htm) as indicated in the minutes for the January/February WG14 virtual meeting recorded in [N2941](https://www.open-std.org/jtc1/sc22/wg14/www/docs/n2941.pdf). – Tom Honermann Nov 19 '22 at 18:47
  • Ah, Tom is back. 2 or 3 years latter bu he is indeed back. Welcome, Tom. – Chef Gladiator Nov 20 '22 at 21:58
4

The common answer given from C++ authorities at the yearly CppCon convention (like in 2018 and 2019) was that should you pick your own UTF8 library to do so. There are all kinds of flavours just pick the one you like. There is still embarrassing little understanding and support for unicode on the C++ side.

Some people hope there will be something in C++23 but we don't even have an official working group so far.

Lothar
  • 12,537
  • 6
  • 72
  • 121
  • 1
    We “don’t even have an official working group” as in [SG16](https://isocpp.org/std/the-committee) that has existed for more than a year? – Davis Herring Nov 21 '19 at 01:35
  • 4
    What a joke. Why add it in the standard then in the first place if we only get a rudimentary incomplete draft which breaks existing code? – Sebastian Hoffmann Jul 07 '20 at 11:57
  • Now with C++23 just 4 month away i can promise you that there is no improvement coming. Hopeing for C++26 or C++29 ... or never ever. – Lothar Aug 14 '22 at 19:12
  • Apparently, as per @Tom Honnerman comment from yesterday, it seems (and that is my interpretation) C++23 will have utf8 fully implemented. PS: standard C23 will have it too. – Chef Gladiator Nov 20 '22 at 22:20
4

AFAIK C++ doesn't yet provide facilities for such conversion. However, I would recommend against using std::u8string in the first place because it is poorly supported in the standard and not supported by any system APIs at all (and will likely never be because of compatibility reasons). On most platforms normal char strings are already UTF-8 and on Windows with MSVC you can compile with /utf-8 which will give you portable Unicode support on major operating systems.

bames53
  • 86,085
  • 15
  • 179
  • 244
vitaut
  • 49,672
  • 25
  • 199
  • 336
3

Update 2022 NOV 20

If anybody is still here, wrangling with utf8 and C++. Apparently, after years of waiting, we have the first leaves of utf8 spring (in November 2022). Please see Tom's comment below.

I assume in a few short months' time all "3" will have utf8 implemented. With a bit of luck and C++23 will finally contain a full implementation of utf8. As advertised, way back with C++11.

Just checked cl.exe on this machine (19.34.31933) has no standard <cuchar>, fully implemented.

Update 2022 APR 19

// with warnings but prints ok on LINUX
// g++ prog.cc  -Wall -Wextra -std=c++2a
//
// clang++ prog.cc -Wall -Wextra -std=c++2a
// lot of warnings but prints OK on LINUX
//
#include <cassert>
#include <clocale>
#include <cstdio>
#include <cstdlib>  // MB_CUR_MAX
#include <cuchar>

#undef P_
#undef P

#define P_(F_, X_) printf("\n%4d : %32s => " F_, __LINE__, #X_, (X_))
#define P(F_, X_) P_(F_, X_)

/*
  using mbstate_t = ... see description ...
  using size_t = ... see description ...

  in the standard but not implemented yet by any of the three

  size_t mbrtoc8(char8_t* pc8, const char* s, size_t n, mbstate_t* ps);
  size_t c8rtomb(char* s, char8_t c8, mbstate_t* ps);
*/

namespace {
constexpr inline auto bad_size = ((size_t)-1);

// https://en.wikipedia.org/wiki/UTF-8
// a compile time constant not intrinsic function
constexpr inline int UTF8_CHAR_MAX_BYTES = 4;

#ifdef STANDARD_CUCHAR_IMPLEMENTED
template <size_t N>
 auto char_star(const char8_t (&in)[N]) noexcept {
  mbstate_t state;
  constexpr static int out_size = (UTF8_CHAR_MAX_BYTES * N) + 1;

  struct {
    char data[out_size];
  } out = {{0}};
  char* one_char = out.data;
  for (size_t rc, n = 0; n < N; ++n) {
    rc = c8rtomb(one_char, in[n], &state);
    if (rc == bad_size) break;
    one_char += rc;
  }
  return out;
}
#endif //  STANDARD_CUCHAR_IMPLEMENTED

template <size_t N>
auto char_star(const char16_t (&in)[N]) noexcept {
  mbstate_t state;
  constexpr static int out_size = (UTF8_CHAR_MAX_BYTES * N) + 1;

  struct final {
    char data[out_size];
  } out = {{0}};
  char* one_char = out.data;
  for (size_t rc, n = 0; n < N; ++n) {
    rc = c16rtomb(one_char, in[n], &state);
    if (rc == bad_size) break;
    one_char += rc;
  }
  return out;
}

template <size_t N>
auto char_star(const char32_t (&in)[N]) noexcept {
  mbstate_t state;
  constexpr static int out_size = (UTF8_CHAR_MAX_BYTES * N) + 1;

  struct final {
    char data[out_size];
  } out = {{0}};
  char* one_char = out.data;
  for (size_t rc, n = 0; n < N; ++n) {
    rc = c32rtomb(one_char, in[n], &state);
    if (rc == bad_size) break;
    one_char += rc;
  }
  return out;
}

}  // namespace
#define KATAKANA "片仮名"
#define KATAKANA8 u8"片仮名"
#define KATAKANA16 u"片仮名"
#define KATAKANA32 U"片仮名"

int main(void) {
  P("%s", KATAKANA);  // const char *
  // lot of warnings but ok output
  P("%s", KATAKANA8);  // const char8_t *

  /*
  garbled or no output
  P( "%s",  KATAKANA16 ); // const char16_t *
  P( "%s" , KATAKANA32 ); // const char32_t *
  */

  setlocale(LC_ALL, "en_US.utf8");

  // no can do as there is no standard <cuchar> yet
  // P( "%s", char_star(KATAKANA8).data );  // const char8_t *
  P("%s", char_star(KATAKANA16).data);  // const char16_t *
  P("%s", char_star(KATAKANA32).data);  // const char32_t *
}

Update 2021 MAR 19

Few things have (not) happened. __STDC_UTF_8__ is no more and <cuchar> is still not implemented by any of "the Three".

Probably much better code matching this thread is HERE.

Update 2020 MAR 17

std::c8rtomb and std::mbrtoc8 are not yet provided.

2019 NOV

std::c8rtomb and std::mbrtoc8 are not yet provided, by the future C++20 ready compilers made by "The 3", to enable the conversion between the execution encoding and UTF-8. They are described in the C++20 standard.

It might be subjective, but c8rtomb() is not an "awkward" interface, to me.

WANDBOX

//  g++ prog.cc -std=gnu++2a
//  clang++ prog.cc -std=c++2a
#include <stdio.h>
#include <clocale>
#ifndef __clang__
#include <cuchar>
#else
// clang has no <cuchar>
#include <uchar.h>
#endif
#include <climits>

template<size_t N>
void  u32sample( const char32_t (&str32)[N] )
{
    #ifndef __clang__
    std::mbstate_t state{};
    #else
    mbstate_t state{};
    #endif
    
    char out[MB_LEN_MAX]{};
    for(char32_t const & c : str32)
    {
    #ifndef __clang__
        /*std::size_t rc =*/ std::c32rtomb(out, c, &state);
    #else
        /* std::size_t rc =*/ ::c32rtomb(out, c, &state);
    #endif
        printf("%s", out ) ;
    }
}

#ifdef __STDC_UTF_8__
template<size_t N>
void  u8sample( const char8_t (& str8)[N])
{
    std::mbstate_t state{};
    
    char out[MB_LEN_MAX]{};
    for(char8_t const & c : str8)
    {
       /* std::size_t rc = */ std::c8rtomb(out, c, &state);
        printf("%s", out ) ;
    }
}
#endif // __STDC_UTF_8__
int main () {
    std::setlocale(LC_ALL, "en_US.utf8");

    #ifdef __linux__
    printf("\nLinux like OS, ") ;
    #endif

    printf(" Compiler %s\n", __VERSION__   ) ;
    
   printf("\nchar32_t *, Converting to 'char *', and then printing --> " ) ;
   u32sample( U"ひらがな" ) ;
    
  #ifdef __STDC_UTF_8__
   printf("\nchar8_t *, Converting to 'char *', and then printing --> " ) ;
   u8sample( u8"ひらがな" ) ;
  #else
   printf("\n\n__STDC_UTF_8__ is not defined, can not use char8_t");
  #endif
   
   printf("\n\nDone ..." ) ;
    
    return 42;
}

I have commented out and documented, lines which do not compile as of today.

Chef Gladiator
  • 902
  • 11
  • 23
  • 1
    The `c8rtomb` and `mbrtoc8` functions are available in glibc starting with the [2022-08-02 2.36 release](https://sourceware.org/pipermail/libc-alpha/2022-August/141193.html). `std::c8rtomb` and `std::mbrtoc8` are exposed by libstdc++ starting with gcc 12.1 (but requires that libstdc++ is built against glibc 2.36 or newer) and will be exposed by libc++ starting with Clang 16 (when the C library implementations are available in the global namespace). I don't have information on other C or C++ standard library implementations. – Tom Honermann Nov 19 '22 at 19:01
  • A controversial subject; in the last few days I got +10 and -2 on this one (laugh) – Chef Gladiator May 13 '23 at 10:06
0

VS 2019

  ostream& operator<<(ostream& os, const u8string& str)
    {
        os << reinterpret_cast<const char*>(str.data());
        return os;
    }

To set console to UTF-8 use https://github.com/MicrosoftDocs/cpp-docs/issues/1915#issuecomment-589644386

YShmidt
  • 9
  • 1
-1

Here's the code that should be conforming to C++20. Since no compiler currently (March 2020) implements conversion functions defined in the paper, I decided not to constrain myself with what is currently implemented and use full spec of C++20. So instead of taking std::basic_string or std::basic_string_view I take ranges of code units. The return value is less general but it is trivial to change it to take output range instead. This is left as an exercise to the reader.

/// \brief Converts the range of UTF-8 code units to execution encoding.
/// \tparam R Type of the input range.
/// \param[in] input Input range.
/// \return std::string in the execution encoding.
/// \throw std::invalid_argument If input sequence is ill-formed.
/// \note This function depends on the global locale.
template <std::ranges::input_range R>
requires std::same_as<std::ranges::range_value_t<R>, char8_t>
std::string ToECSString(R&& input)
{
    std::string output;
    char temp_buffer[MB_CUR_MAX];
    std::mbstate_t mbstate{};
    auto i = std::ranges::begin(input);
    auto end = std::ranges::end(input);
    for (; i != end; ++i)
    {
        std::size_t result = std::c8rtomb(temp_buffer, *i, &mbstate);
        if (result == -1)
        {
            throw std::invalid_argument{"Ill-formed UTF-8 sequence."};
        }
        output.append(temp_buffer, temp_buffer + result);
    }
    return output;
}

/// \brief Converts the input range of code units in execution encoding to
/// UTF-8.
/// \tparam R Type of the input range.
/// \param[in] input Input range.
/// \return std::u8string containing UTF-8 code units.
/// \throw std::invalid_argument If input sequence is ill-formed or does not end
/// at the scalar value boundary.
/// \note This function depends on the global C locale.
template <std::ranges::input_range R>
requires std::same_as<std::ranges::range_value_t<R>, char>
std::u8string ToUTF8String(R&& input)
{
    std::u8string output;
    char8_t temp_buffer;
    std::mbstate_t mbstate{};
    std::size_t result;
    auto i = std::ranges::begin(input);
    auto end = std::ranges::end(input);
    while (i != end)
    {
        result = std::mbrtoc8(&temp_buffer, std::to_address(i), 1, &mbstate);
        switch (result)
        {
            case 0:
            {
                ++i;
                break;
            }
            case std::size_t(-3):
            {
                break;
            }
            case std::size_t(-2):
            {
                ++i;
                break;
            }
            case std::size_t(-1):
            {
                throw std::invalid_argument{"Invalid input sequence."};
            }
            default:
            {
                std::ranges::advance(i, result);
                break;
            }
        }
        if (result != std::size_t(-2))
        {
            output.append(1, temp_buffer);
        }
    }
    if (result == -2)
    {
            throw std::invalid_argument{
                "Code unit sequence does not end at the scalar value "
                "boundary."};
    }
    return output;
}

/// \brief Converts the contiguous range of code units in execution encoding to
/// UTF-8.
/// \tparam R Type of the contiguous range.
/// \param[in] input Input range.
/// \return std::u8string containing UTF-8 code units.
/// \throw std::invalid_argument If input sequence is ill-formed or does not end
/// at the scalar value boundary.
/// \note This function depends on the global C locale.
template <std::ranges::contiguous_range R>
requires std::same_as<std::ranges::range_value_t<R>, char>
std::u8string ToUTF8String(R&& input)
{
    std::u8string output;
    char8_t temp_buffer;
    std::mbstate_t mbstate{};
    std::size_t offset = 0;
    std::size_t size = std::ranges::size(input);
    while (offset != size)
    {
        std::size_t result = std::mbrtoc8(&temp_buffer,
            std::ranges::data(input) + offset, size - offset, &mbstate);
        switch (result)
        {
            case 0:
            {
                ++offset;
                break;
            }
            case std::size_t(-3):
            {
                break;
            }
            case std::size_t(-2):
            {
                throw std::invalid_argument{
                    "Input sequence does not end at the scalar value "
                    "boundary."};
            }
            case std::size_t(-1):
            {
                throw std::invalid_argument{"Invalid input sequence."};
            }
            default:
            {
                offset += result;
                break;
            }
        }
        output.append(1, temp_buffer);
    }
    return output;
}
  • In the first `ToUTF8String` overload, I think your handling of the `-2` case is not quite right. In that case, no value is stored in `temp_buffer`, so the unconditional append to `output` after the switch statement is incorrect. Otherwise, this code looks pretty nice to me! – Tom Honermann Mar 20 '20 at 02:51