9

I understand that char in C++ is just an integer type that stores ASCII symbols as numbers ranging from 0 to 127. The Scandinavian letters 'æ', 'ø', and 'å' are not among the 128 symbols in the ASCII table.

So naturally when I try char ch1 = 'ø' I get a compiler error, however string str = "øæå" works fine, even though a string makes use of chars right?

Does string somehow switch over to Unicode?

curiousguy
  • 8,038
  • 2
  • 40
  • 58
That new guy
  • 103
  • 1
  • 4

4 Answers4

8

In C++ there is the source character set and the execution character set. The source character set is what you can use in your source code; but this doesn't have to coincide with which characters are available at runtime.

It's implementation-defined what happens if you use characters in your source code that aren't in the source character set. Apparently 'ø' is not in your compiler's source character set, otherwise you wouldn't have gotten an error; this means that your compiler's documentation should include an explanation of what it does for both of these code samples. Probably you will find that str does have some sort of sequence of bytes in it that form a string.

To avoid this you could use character literals instead of embedding characters in your source code, in this case '\xF8'. If you need to use characters that aren't in the execution character set either, you can use wchar_t and wstring.

M.M
  • 138,810
  • 21
  • 208
  • 365
  • "Apparently 'ø' is not in your compiler's source character set, otherwise you wouldn't have gotten an error;" That's not necessarily true. There are some cases that the spec does describe exactly what happens, such as when a character does exist in the execution encoding but is not represented by a single numerical value (e.g., multi-byte encodings). – bames53 Apr 25 '14 at 05:30
  • `wchar_t` character and string literals use the 'wide execution charset', which could be just as limited as the regular execution charset. But really compilers ought to support the various Unicode encodings as execution charsets. – bames53 Apr 25 '14 at 05:38
  • You could also use `u'ø'` or `U'ø'` instead. – Cœur Jun 03 '20 at 13:36
  • @Cœur compilers are not required to support that – M.M Jun 03 '20 at 21:46
7

From the source code char c = 'ø';:

source_file.cpp:2:12: error: character too large for enclosing character literal type
  char c = '<U+00F8>';
           ^

What's happening here is that the compiler is converting the character from the source code encoding and determining that there's no representation of that character using the execution encoding that fits inside a single char. (Note that this error has nothing to do with the initialization of c, it would happen with any such character literal. examples)

When you put such characters into a string literal rather than a character literal, however, the compiler's conversion from the source encoding to the execution encoding is perfectly happy to use multi-byte representations of the characters when the execution encoding is multi-byte, such as UTF-8 is.

To better understand what compilers do in this area you should start by reading clauses 2.3 [lex.charsets], 2.14.3 [lex.ccon], and 2.14.5 [lex.string] in the C++ standard.

bames53
  • 86,085
  • 15
  • 179
  • 244
  • So why does this happen on clang but works with gcc. clang allows other multibyte chars like 'aa' – Lewis Kelsey May 30 '20 at 19:40
  • 1
    Clang reads the single character and treats it as a single c-char in the C++ grammar, meaning that the result must not be a _multicharacter literal_. Gcc is treating the single codepoint, which encodes as multiple bytes in UTF-8, as a sequence c-chars. There may even be a way to interpret gcc's behavior such that it technically coforms to the spec (though it's not immediately obvious as a UCNs is specified in the spec to be a single _c-char_). I think clang's behavior is better (as well as conforming to the spec). – bames53 Jun 05 '20 at 06:56
4

What's likely happening here is that your source file is encoded as UTF-8 or some other multi-byte character encoding, and the compiler is simply treating it as a sequence of bytes. A single char can only be a single byte, but a string is perfectly happy to be as many bytes as are required.

Mark Ransom
  • 299,747
  • 42
  • 398
  • 622
  • If the compiler isn't handling source encoding to execution encoding conversions then a character with a multi-byte source encoding in a character literal will generally produce a multi-char rather than an error. That's what gcc does and why the code `std::cout << std::hex << 'ø';` prints 'c3b8' when built with gcc. – bames53 Apr 25 '14 at 05:35
  • @bames53 a character literal larger than a `char` will generate at least a warning if you try to assign it to a `char` variable. Your example sidesteps that conversion. – Mark Ransom Apr 25 '14 at 06:00
0

The ASCII for C++ is only 128 characters. If you want 'ø' which is ASCII-EXTENDED 248 out of (255) which is 8 bit (is not a character value) that included 7 bit from ASCII. you can try char ch1 ='\xD8';

Shang
  • 518
  • 3
  • 13