13

Short version:
If I wanted to write program that can efficiently perform operations with Unicode characters, being able to input and output files in UTF-8 or UTF-16 encodings. What is the appropriate way to do this with C++?

Long version:
C++ predates Unicode, and both have evolved significantly since. I need to know how to write standards-compliant C++ code that is leak-free. I need a clear answers to:

  • Which string container should I pick?

    • std::string with UTF-8?
    • std::wstring (don't really know much about it)
    • std::u16string with UTF-16?
    • std::u32string with UTF-32?
  • Should I stick entirely to one of the above containers or change them when needed?

  • Can I use non-english characters in string literals, when using UTF strings, such as Polish characters: ąćęłńśźż etc?

  • What changes when we store UTF-8 encoded characters in std::string? Are they limited to one-byte ASCII characters or can they be multi-byte?
    What happens when i do the following?

     std::string s = u8"foo";
     s += 'x';
    
  • What are differences between wchar_t and other multi-byte character types? Is wchar_t character or wchar_t string literal capable of storing UTF encodings?

vy32
  • 28,461
  • 37
  • 122
  • 246
Poeta Kodu
  • 1,120
  • 8
  • 16
  • One question per question please. As-is this is way too broad for SO format. – Ron Feb 15 '18 at 21:48
  • 1
    It's an interesting topic though - where should it go instead? – CookiePLMonster Feb 15 '18 at 21:50
  • I know that this is huge topic and I may be wrong, but asking these questions separately would take ages. In my opinion they are linked to each other. I am quite a newbie in SO, forgive me please. – Poeta Kodu Feb 15 '18 at 21:50
  • 3
    @razzorflame I suggest breaking it down into multiple questions. After doing the proper online research, that is. – Ron Feb 15 '18 at 21:51
  • *"If I wanted to write ideally a bug free program"* - Let go of this notion, it's impossible. You are not Jon Skeet. – StoryTeller - Unslander Monica Feb 15 '18 at 21:51
  • @StoryTeller, sorry that should look like this: "... to write, ideally, a bug free program". – Poeta Kodu Feb 15 '18 at 21:53
  • 1
    If a program needs to process unicode characters (like perform regex or replacements, editing etc..) then I usually convert everything to `UTF-32` and work with that internally. Otherwise `UTF-8` for `i/o` and display. Using `GUI` toolkits may force you to work with a particular encoding. Some `UTF-16` others `UTF-8` etc... – Galik Feb 15 '18 at 21:53
  • 2
    This was a useful resource for me when dealing with Unicode in C++: http://utf8everywhere.org/ – Ardavel Feb 15 '18 at 21:59
  • 3
    Take a few secs to think about left-to-right vs right-to-left; what exactly is a space ( https://blogs.msdn.microsoft.com/oldnewthing/20180214-00/?p=98016 ); what sort of text processing are you going to do? The storage problem (any of the UTFs) is not the end of the issue just the beginning. eg your toupper / tolower does not make sense for some languages which will not round-trip (eg German) or even have a conversion. – Richard Critten Feb 15 '18 at 22:00
  • 2
    There's no usable Unicode support in the C++ standard. The problem is solved with third party libraries and frameworks. – n. m. could be an AI Feb 15 '18 at 22:30
  • So why L"", u8"", u"" and U"" string literals exist? cppreference clearly says about UTF. – Poeta Kodu Feb 15 '18 at 22:37
  • 2
    @razzorflame: they exist because they are necessary to encode Unicode string literals in the app's memory. But **using** Unicode data at runtime is a huge can of worms, requiring many logistics beyond just in-memory representation. – Remy Lebeau Feb 15 '18 at 22:39

1 Answers1

13

Which string container should I pick?

That is really up to you to decide, based on your own particular needs. Any of the choices you have presented will work, and they each have their own advantages and disadvantages. Generically, UTF-8 is good to use for storage and communication purposes, and is backwards compatible with ASCII. Whereas UTF-16/32 is easier to use when processing Unicode data.

std::wstring (don't really know much about it)

The size of wchar_t is compiler-dependent and even platform-dependent. For instance, on Windows, wchar_t is 2 bytes, making std::wstring usable for UTF-16 encoded strings. On other platforms, wchar_t may be 4 bytes instead, making std::wstring usable for UTF-32 encoded strings instead. That is why wchar_t/std::wstring is generally not used in portable code, and why char16_t/std::u16string and char32_t/std::u32string were introduced in C++11. Even char can have portability issues for UTF-8, since char can be either signed or unsigned at the descretion of the compiler vendors, which is why char8_t/std::u8string was introduced in C++20 for UTF-8.

Should I stick entirely to one of the above containers or change them when needed?

Use whatever containers suit your needs.

Typically, you should use one string type throughout your code. Perform data conversions only at the boundaries where string data enters/leaves your program. For instance, when reading/writing files, network communications, platform system calls, etc.

How to properly convert between them?

There are many ways to handle that.

C++11 and later have std::wstring_convert/std::wbuffer_convert. But these were deprecated in C++17.

There are 3rd party Unicode conversion libraries, such as ICONV, ICU, etc.

There are C library functions, platform system calls, etc.

Can I use non-english characters in string literals, when using UTF strings, such as Polish characters: ąćęłńśźż etc?

Yes, if you use appropriate string literal prefixes:

u8 for UTF-8.

L for UTF-16 or UTF-32 (depending on compiler/platform).

u16 for UTF-16.

u32 for UTF-32.

Also, be aware that the charset you use to save your source files can affect how the compiler interprets string literals. So make sure that whatever charset you choose to save your files in, like UTF-8, that you tell your compiler what that charset is, or else you may end up with the wrong string values at runtime.

What changes when we store UTF-8 encoded characters in std::string? Are they limited to one-byte ASCII characters or can they be multi-byte?

Each string character may be a single-byte, or be part of a multi-byte representation of a Unicode codepoint. It depends on the encoding of the string, and the character being encoded.

Just as std::wstring (when wchar_t is 2 bytes) and std::u16string can hold strings containing supplementary characters outside of the Unicode BMP, which require UTF-16 surrogates to encode.

When a string container contains a UTF encoded string, each "character" is just a UTF encoded codeunit. UTF-8 encodes a Unicode codepoint as 1-4 codeunits (1-4 chars in a std::string). UTF-16 encodes a codepoint as 1-2 codeunits (1-2 wchar_ts/char16_ts in a std::wstring/std::u16string). UTF-32 encodes a codepoint as 1 codeunit (1 char32_t in a std::u32string).

What happens when i do the following?

std::string s = u8"foo";
s += 'x';

Exactly what you would expect. A std::string holds char elements. Regardless of encoding, operator+=(char) will simply append a single char to the end of the std::string.

How can I distinguish UTF char[] and non-UTF char[] or std::string?

You would need to have outside knowledge of the string's original encoding, or else perform your own heuristic analysis of the char[]/std::string data to see if it conforms to a UTF or not.

What are differences between wchar_t and other multi-byte character types?

Byte size and UTF encoding.

char = ANSI/MBCS or UTF-8

wchar_t = DBCS, UTF-16 or UTF-32, depending on compiler/platform

char8_t = UTF-8

char16_t = UTF-16

char32_t = UTF-32

Is wchar_t character or wchar_t string literal capable of storing UTF encodings?

Yes, UTF-16 or UTF-32, depending on compiler/platform. In case of UTF-16, a single wchar_t can only hold a codepoint value that is in the BMP. A single wchar_t in UTF-32 can hold any codepoint value. A wchar_t string can encode all codepoints in either encoding.

How to properly manipulate UTF strings (such as toupper/tolower conversion) and be compatible with locales simultaneously?

That is a very broad topic, worthy of its own separate question by itself.

Remy Lebeau
  • 555,201
  • 31
  • 458
  • 770
  • Thank you for shedding light on this topic. I understand now how huge this topic really is. I need to ask one or two more extensive questions about that, but as you said, it is worth a separate question. – Poeta Kodu Feb 15 '18 at 22:48
  • Nice write up. Minor point: You always could use so-called non-English characters in string literals. The compiler encodes them to the specific "execution charset" (encoding) as directed, with some compilers defaulting to UTF-8. This the case for `'x'` in the question. – Tom Blodget Feb 16 '18 at 00:23
  • @MooingDuck "*It's important to call out that `std::string` is `char` based (aka not multi-byte aware), and though it can _store_ multibyte data, all methods use `char` index and `char` length*" - the same can be said of any of the string types. They store encoded units, not Unicode codepoints. What people generally perceive as characters (ie graphemes) is not what string interfaces are designed for at the, you usually need higher level Unicode libraries to deal with those. – Remy Lebeau Dec 05 '20 at 18:27
  • @MooingDuck "*you need to have outside knowledge beyond just the type*" - added – Remy Lebeau Dec 05 '20 at 18:41