0

I was using string literals with \u encoded sequences, for passing them to QString::fromUtf8, like in QString::fromUtf8("Precio (\u20AC/k)");, with success, but, struggling my mind a bit, and reading cppreference and other sources, I have not clear how is the \u20AC sequence translated in binary.

In other words, what is QString::fromUtf8 receaving in substitution of the \u20AC sequence? The UTF-8 representation of the 20AC codepoint? Its UTF-16 representation? Or what?

The most majority of the sources I've read says it is translated to its UTF-16 representation, which means that I'm doing wrong when passing that string to fromUtf8, however, it has always worked fine.

Am I doing to right thing or not?

ABu
  • 10,423
  • 6
  • 52
  • 103

1 Answers1

1

The encoding of the unprefixed string literal "..." is implementation-defined. On many non-Windows compilers it defaults to UTF-8, although it may sometimes be changed; for GCC this switch is -fexec-charset (docs).

To get UTF-8 encoding of a string literal independent of the execution character set, C++11 introduced u8"..." (cppreference)

Cubbi
  • 46,567
  • 13
  • 103
  • 169
  • Even if is there no Unicode escape sequences? So, a pure-ASCII unprefixed literal string like `"hello"` could be UTF-16 encoded? – ABu Oct 14 '16 at 11:10
  • 1
    @Peregring-lk no, UTF-16 is not permitted as the multibyte encoding of the execution character set. It may be used as the wide: that weird compiler (Visual Studio) where wchar_t is 16 bit might use it as the encoding of `L"hello"` – Cubbi Oct 14 '16 at 11:21
  • Ok, so, the encoding of a unprefixed string literal could depend of their contents, isn't it? If a string contains only the execution character set, it is ASCII encoded (= UTF-8, = ISO-8859-1, etc), but if it contains other characters, the encoding of these particular string is implementation-defined. Isn't it? – ABu Oct 14 '16 at 12:49
  • @Peregring-lk no, it doesn't depend on content, it's one commandline option for all string literals. – Cubbi Oct 14 '16 at 12:55
  • Ok, but, then, I deduced that, although implementation-defined, it must be an ASCII-compatible encoding, isn't it? – ABu Oct 14 '16 at 13:01
  • 1
    @Peregring-lk not sure what you mean by "ASCII-based", but the two restrictions are: all 96 characters of the basic charset have single'byte representations and '0'-'9' appear in that order, so you can `c-'0'` – Cubbi Oct 14 '16 at 13:12