QString::fromUtf8(str) and \u encoded character sequences

Question

I was using string literals with \u encoded sequences, for passing them to QString::fromUtf8, like in QString::fromUtf8("Precio (\u20AC/k)");, with success, but, struggling my mind a bit, and reading cppreference and other sources, I have not clear how is the \u20AC sequence translated in binary.

In other words, what is QString::fromUtf8 receaving in substitution of the \u20AC sequence? The UTF-8 representation of the 20AC codepoint? Its UTF-16 representation? Or what?

The most majority of the sources I've read says it is translated to its UTF-16 representation, which means that I'm doing wrong when passing that string to fromUtf8, however, it has always worked fine.

Am I doing to right thing or not?

score 1 · Accepted Answer · answered Oct 14 '16 at 02:16

1

The encoding of the unprefixed string literal "..." is implementation-defined. On many non-Windows compilers it defaults to UTF-8, although it may sometimes be changed; for GCC this switch is -fexec-charset (docs).

To get UTF-8 encoding of a string literal independent of the execution character set, C++11 introduced u8"..." (cppreference)

answered Oct 14 '16 at 02:16

Cubbi

46,567
13
103
169

Even if is there no Unicode escape sequences? So, a pure-ASCII unprefixed literal string like `"hello"` could be UTF-16 encoded? – ABu Oct 14 '16 at 11:10
1

@Peregring-lk no, UTF-16 is not permitted as the multibyte encoding of the execution character set. It may be used as the wide: that weird compiler (Visual Studio) where wchar_t is 16 bit might use it as the encoding of `L"hello"` – Cubbi Oct 14 '16 at 11:21
Ok, so, the encoding of a unprefixed string literal could depend of their contents, isn't it? If a string contains only the execution character set, it is ASCII encoded (= UTF-8, = ISO-8859-1, etc), but if it contains other characters, the encoding of these particular string is implementation-defined. Isn't it? – ABu Oct 14 '16 at 12:49
@Peregring-lk no, it doesn't depend on content, it's one commandline option for all string literals. – Cubbi Oct 14 '16 at 12:55
Ok, but, then, I deduced that, although implementation-defined, it must be an ASCII-compatible encoding, isn't it? – ABu Oct 14 '16 at 13:01
1

@Peregring-lk not sure what you mean by "ASCII-based", but the two restrictions are: all 96 characters of the basic charset have single'byte representations and '0'-'9' appear in that order, so you can `c-'0'` – Cubbi Oct 14 '16 at 13:12

QString::fromUtf8(str) and \u encoded character sequences

1 Answers1