What is the point of the UTF-8 character literals proposed for C++17?

Question

What exactly is the point of these as proposed by N4267 ?

Their only function seems to be to prevent extended ASCII characters or partial UTF-8 code points from being specified. They still store in a fixed-width 8-bit char (which, as I understand it, is the correct and best way to handle UTF-8 anyway for almost all use cases), so they don't support non-ASCII characters at all. What is going on?

(Actually I'm not entirely sure I understand the need for UTF-8 string literals either. I guess it's the worry of compilers doing weird/ambiguous things with Unicode strings coupled with validation of the Unicode?)

Perhaps [this is helpful](http://stackoverflow.com/a/30872695/1708801) — Shafik Yaghmour, Aug 12 '15 at 16:00
Ah, thanks a lot for that link, I did find that question but clearly I didn't scroll down enough! That makes sense, so it's basically just to guarantee that a character is ASCII? It's a rather poor name, in that case, I must say! Seems like a feature by coincidence rather than by design... — Muzer, Aug 12 '15 at 16:02
I frequently do not find the answer I am looking for in the accepted answer and often enough I have to go to the middle of the answers to find it. This can be for many reasons, often good answers come late or perhaps months or years later in some cases. — Shafik Yaghmour, Aug 12 '15 at 17:38
I think they should stop doing this kind of things to the language. Someone forward them utf8everywhere.org — Pavel Radzivilovsky, Sep 07 '15 at 05:33

Shafik Yaghmour · Accepted Answer · 2015-08-13T13:55:39.557

18

The rationale is covered in by the Evolution Working Group issue 119: N4197 Adding u8 character literals, [tiny] Why no u8 character literals? which tracked the proposal and says:

We have five encoding-prefixes for string-literals (none, L, u8, u, U) but only four for character literals -- the missing one is u8 for character literals.

This matters for implementations where the narrow execution character set is not ASCII. In such a case, u8 character literals would provide an ideal way to write character literals with guaranteed ASCII encoding (the single-code-unit u8 encodings are exactly ASCII), but... we don't provide them. Instead, the best one can do is something like this:
char x_ascii = { u'x' };
... where we'll get a narrowing error if the codepoint doesn't fit in a 'char'. (Note that this is not quite the same as u8'x', which would give us an error if the codepoint was not representable as a single code unit in UTF-8.)

edited Aug 13 '15 at 13:55

answered Aug 12 '15 at 16:04

Shafik Yaghmour

154,301
39
440
740

3

Maybe you should also mention that if the u8 requires several bytes it is ill formed (according to your link in the comment). I think it's the biggest benefit, because nowadays writting something like `if (c=='ç')` seems to work but might in fact not give the expected result (either because wrong encoding of the source file, or because only the first byte is used, whereas u8'ç' would raise an error showing that there is a real problem here. – Christophe Aug 12 '15 at 16:11
@Christophe good point, updated answer to include that part as well. – Shafik Yaghmour Aug 12 '15 at 17:35
4

Cheers, this is clearly the correct answer. I don't much care for it, as it seems very counter-intuitive to have a prefix that means "UTF-8" that actually accepts only ASCII, but I suppose it's better than the alternatives. Thanks! – Muzer Aug 13 '15 at 12:24

What is the point of the UTF-8 character literals proposed for C++17?

1 Answers1

Linked