C++: portability of Unicode string literals

Question

While debugging on gcc, I found that the Unicode literal u"万不得已" was represented as u"\007\116\015\116\227\137\362\135". Which makes sense -- 万 is 0x4E07, and 0x4E in octal is 116.

Now on Apple LLVM 9.1.0 on an Intel-powered Macbook, I find that that same literal is not handled as the same string, ie:

u16string{u"万不得已"} == u16string{u"\007\116\015\116\227\137\362\135"}

goes from true to false. I'm still on a little-endian system, so I don't understand what's happening.

NB. I'm not trying to use the correspondence u"万不得已" == u"\007\116\015\116\227\137\362\135". I just want to understand what's happening.

@πάντα ῥεῖ : what does that mean? Why does the encoding differ between compilers? — Mohan, Aug 15 '18 at 23:48

Remy Lebeau · Accepted Answer · 2018-08-16T00:29:31.333

7

I found that the Unicode literal u"万不得已" was represented as u"\007\116\015\116\227\137\362\135"

No, actually it is not. And here's why...

u"..." string literals are encoded as a char16_t-based UTF-16 encoded string on all platforms (that is what the u prefix is specifically meant for).

u"万不得已" is represented by this UTF-16 codeunit sequence:

4E07 4E0D 5F97 5DF2

On a little-endian system, that UTF-16 sequence is represented by this raw byte sequence:

07 4E 0D 4E 97 5F F2 5D

In octal, that would be represented by "\007\116\015\116\227\137\362\135" ONLY WHEN using a char-based string (note the lack of a string prefix, or u8 would also work for this example).

u"\007\116\015\116\227\137\362\135" is NOT a char-based string! It is a char16_t-based string, where each octal number represents a separate UTF-16 codeunit. Thus, this string actually represents this UTF-16 codeunit sequence:

0007 004E 000D 004E 0097 005F 00F2 005D

That is why your two u16string objects are not comparing as the same string value. Because they are really not equal.

You can see this in action here: Live Demo

edited Aug 16 '18 at 00:29

answered Aug 16 '18 at 00:20

Remy Lebeau

555,201
31
458
770

That makes perfect sense. It means the debugger is wrong. What I don't now understand is why the equivalence held on gcc. (gcc 7.3.0 on Cygwin.) – Mohan Aug 16 '18 at 00:29
And NB I just reran the test to confirm gcc behaviour. – Mohan Aug 16 '18 at 00:30
1

@Mohan that means the debugger is displaying the **raw bytes** of the `u16string` (which makes sense, if the debugger does not support displaying `char16_t` data as normal characters), and doing so in octal (odd, why not hex?), but I would not expect it to include the `u` prefix on such a raw string. But there is no way that `u16string{u"万不得已"} == u16string{u"\007\116\015\116\227\137\362\135"}` should ever be true on any compiler. If it is, that is a bug that should be reported to the compiler vendor. – Remy Lebeau Aug 16 '18 at 00:32
It does include the u prefix. Now you've pointed it out it's clear that's a bug. – Mohan Aug 16 '18 at 00:34
@Mohan The debugger is likely displaying the `u` prefix because the data is `char16_t`, but then it is displaying the raw 8bit bytes instead of the native 16bit values because `char16_t` can go up to 0xFFFF (decimal 65535) so most values are simply too large to handle in octal, which has a max value of 0x1FF (decimal 511). – Remy Lebeau Aug 16 '18 at 00:38

C++: portability of Unicode string literals

1 Answers1