17

Why is there no UTF-8 character literal in C11 or C++11 even though there are UTF-8 string literals? I understand that, generally-speaking, a character literal represents a single ASCII character which is identical to a single-octet UTF-8 code point, but neither C nor C++ says the encoding has to be ASCII.

Basically, if I read the standard right, there's no guarantee that '0' will represent the integer 0x30, yet u8"0" must represent the char sequence 0x30 0x00.

EDIT:

I'm aware not every UTF-8 code point would fit in a char. Such a literal would only be useful for single-octet code points (aka, ASCII), so I guess calling it an "ASCII character literal" would be more fitting, so the question still stands. I just chose to frame the question with UTF-8 because there are UTF-8 string literals. The only way I can imagine portably guaranteeing ASCII values would be to write a constant for each character, which wouldn't be so bad considering there are only 128, but still...

Yakk - Adam Nevraumont
  • 262,606
  • 27
  • 330
  • 524
jbatez
  • 1,772
  • 14
  • 26
  • 5
    Since it's a a variable width encoding, what could you store it in? – Pubby Jun 07 '12 at 19:11
  • @Pubby: One could store it as a 32+ bit type, requiring zero padding. – Mooing Duck Jun 07 '12 at 19:25
  • @Pubby or just the int literal itself. But we can guarantee ASCII *strings* with `u8"string"` literals; why isn't there a way to guarantee ASCII character literals? – jbatez Jun 07 '12 at 19:50
  • 2
    @JoBates You should ask another question asking "how can I get a guaranteed ASCII string?" Leave off the idea of using utf8 to get there. – Pubby Jun 07 '12 at 19:57
  • 1
    @JoBates: "But we can guarantee ASCII strings with u8"string" literals" No, you can guarantee *UTF-8 strings*. It just so happens that UTF-8 is a superset of ASCII. If that weren't the case, then you would have no such "guarantee". Personally, I see no reason for this, when virtually every compiler's native character set is ASCII (or a superset thereof). – Nicol Bolas Jun 07 '12 at 20:07
  • 2
    You can get your utf-8 character literal like so: `char c = u8"A"[0];` – bames53 Jun 07 '12 at 20:15
  • 1
    @bames53: Unfortunately, that's not a UTF-8 character literal, it's an expression that evaluates to a known character. So you can't use it in a `switch` statement, for example. – Dietrich Epp Jun 07 '12 at 21:18
  • 2
    @DietrichEpp It's a constant expression so in C++11 you actually can use it as a case in a switch statement (`case u8"A"[0]:`). – bames53 Jun 07 '12 at 21:34
  • 2
    Oh, or even better: `*u8"A"`. This is also a constant expression. – bames53 Jun 07 '12 at 21:41
  • 1
    @bames53 And for u8"Я"[0] you'll get the first of the two bytes encoding the cyrrilic symbol =/ – vines Oct 18 '12 at 22:56
  • 2
    u8 character literals are now being considered for C++17: https://isocpp.org/files/papers/n4267.html – jbatez Nov 25 '14 at 16:21

5 Answers5

10

It is perfectly acceptable to write non-portable C code, and this is one of many good reasons to do so. Feel free to assume that your system uses ASCII or some superset thereof, and warn your users that they shouldn't try to run your program on an EBCDIC system.

If you are feeling very generous, you can encode a check. The gperf program is known to generate code that includes such a check.

_Static_assert('0' == 48, "must be ASCII-compatible");

Or, for pre-C11 compilers,

extern int must_be_ascii_compatible['0' == 48 ? 1 : -1];

If you are on C11, you can use the u or U prefix on character constants, but not the u8 prefix...

/* This is useless, doesn't do what you want... */
_Static_assert(0, "this code is broken everywhere");
if (c == '々') ...

/* This works as long as wchar_t is UTF-16 or UTF-32 or UCS-2... */
/* Note: you shouldn't be using wchar_t, though... */
_Static_assert(__STDC_ISO_10646__, "wchar_t must be some form of Unicode");
if (c == L'々') ...

/* This works as long as char16_t is UTF-16 or UCS-2... */
_Static_assert(__STDC_UTF_16__, "char16_t must be UTF-16");
if (c == u'々') ...

/* This works as long as char32_t is UTF-32... */
_Static_assert(__STDC_UTF_32__, "char32_t must be UTF-32");
if (c == U'々') ...

There are some projects that are written in very portable C and have been ported to non-ASCII systems (example). This required a non-trivial amount of porting effort, and there's no real reason to make the effort unless you know you want to run your code on EBCDIC systems.

On standards: The people writing the C standard have to contend with every possible C implementation, including some downright bizarre ones. There are known systems where sizeof(char) == sizeof(long), CHAR_BIT != 8, integral types have trap representations, sizeof(void *) != sizeof(int *), sizeof(void *) != sizeof(void (*)()), va_list are heap-allocated, etc. It's a nightmare.

Don't beat yourself up trying to write code that will run on systems you've never even heard of, and don't search to hard for guarantees in the C standard.

For example, as far as the C standard is concerned, the following is a valid implementation of malloc:

void *malloc(void) { return NULL; }

Note that while u8"..." constants are guaranteed to be UTF-8, u"..." and U"..." have no guarantees except that the encoding is 16-bits and 32-bits per character, respectively, and the actual encoding must be documented by the implementation.

Summary: Safe to assume ASCII compatibility in 2012.

Dietrich Epp
  • 205,541
  • 37
  • 345
  • 415
  • Wait, `u"..."` and `U"..."` aren't required to be UTF-16 and UTF-32? I guess `u8"..."` is the weird one then. So, reverse question! Why does `u8"..."` exist? Maybe I'll write that one up later. – jbatez Jun 07 '12 at 20:44
  • @JoBates They're mandated to be arrays of `char16_t` and `char32_t` respectively. The standards are just short of calling them e.g. "UTF-16 encoded strings" whereas they do mention "UTF-8 encoded strings". Keep in mind that the elements of such arrays *are* Unicode code units and that the C++11 Standard provides facilities to convert to and from what it calls "UTF-16 multibyte sequences". I don't know what it takes to be a UTF-16 or UTF-32 encoded string (and perhaps the standards don't know either), but I know what I can do with `U""`. – Luc Danton Jun 07 '12 at 22:45
  • @LucDanton I just noticed this in the C++11 standard (not in C11): _"The value of a char16_t literal containing a single c-char is equal to its ISO 10646 code point value, provided that the code point is representable with a single 16-bit code unit... The value of a char32_t literal containing a single c-char is equal to its ISO 10646 code point value."_ Does that mean I could write something like `char c = u'0'` thus guaranteeing `c == 0x30`? If that's the case, then I guess the logic behind not including an ASCII char literal is the same as not providing explicitly short int literals. – jbatez Jul 02 '12 at 21:17
8

UTF-8 character literal would have to have variable length - for many most of them, it's not possible to store single character in char or wchar, what type should it have, then? As we don't have variable length types in C, nor in C++, except for arrays of fixed size types, the only reasonable type for it would be const char * - and C strings are required to be null-terminated, so it wouldn't change anything.

As for the edit:

Quote from the C++11 standard:

The glyphs for the members of the basic source character set are intended to identify characters from the subset of ISO/IEC 10646 which corresponds to the ASCII character set. However, because the mapping from source file characters to the source character set (described in translation phase 1) is specified as implementation-defined, an implementation is required to document how the basic source characters are represented in source files.

(footnote at 2.3.1).

I think that it's good reason for not guaranteeing it. Although, as you noted in comment here, for most (or every) mainstream compiler, the ASCII-ness of character literals is implementation guaranteed.

Griwes
  • 8,805
  • 2
  • 43
  • 70
  • I understand that, but for the ones which do fit, it'd be convenient to guarantee you get the ASCII/UTF-8 encoding even though almost every (every?) compiler does anyway. – jbatez Jun 07 '12 at 19:18
  • How useful is that really? That would only be useful if you're just doing ASCII. – R. Martinho Fernandes Jun 07 '12 at 19:23
  • Wait. What about `wchar_t` and `L'0'`? It *is* exactly 0x30 0x00 on any compiler. – Forgottn Jun 07 '12 at 19:25
  • 1
    @Forgottn: Well, it's 0x30 (no 0x00) on most computers, but there are no guarantees. And it's either 16-bits or 32-bits, depending, which is not very useful. – Dietrich Epp Jun 07 '12 at 19:28
  • @rmartinhofernandes Yes, but character literals aren't necessarily guaranteed to translate to ASCII. However, one could use UTF-8 string literals with only ASCII characters to guarantee an ASCII string. – jbatez Jun 07 '12 at 19:36
  • @Griwes wait, do you interpret that quote to mean that character literals **are** guaranteed to map to theri ASCII integer values? "intended to" is pretty vague. – jbatez Jun 07 '12 at 19:40
  • @JoBates, no, I interpret it as "they are not required to map to their ASCII integer values", I didn't wrote anything implying so. The standard doesn't guarantee it, however most implementations probably provide such "implementation guarantee", although I'm not perfectly sure about validity of this point. Plus, I quoted it because it gives pretty valid (but, probably, not so valid soon) reason for this. – Griwes Jun 07 '12 at 19:42
  • Mmm... @Griwes, please tell me which exotic compilers you're going to use? I say, that at least top-5 popular compilers guarantee such mapping. – Forgottn Jun 07 '12 at 19:44
  • @Forgottn, I'm no expert on compilers, and I'm not that kind of person that likes to tell "I'm sure about this" when I'm not. – Griwes Jun 07 '12 at 19:45
  • @Griwes, this looks for me like you won't believe in smth. if you can't touch it. – Forgottn Jun 07 '12 at 19:48
  • @Forgottn, no, I just don't like being wrong; I could probably go and check it in every mainstream compiler - but what advantage would it give me, besides the fact I would know exactly which guarantees what in this area? – Griwes Jun 07 '12 at 19:50
  • @Forgottn According to the standard, `L'0'` is no more likely to be 0x30 than plain-old `'0'`. Whether or not there exists a compiler that doesn't handle it that way is irrelevant; this question asks about the logic of the standard. – jbatez Jun 07 '12 at 19:55
  • There are `u''` UTF-16 character literals though, so this just begs the question. – Luc Danton Jun 07 '12 at 22:49
  • This was recently added in and as I note in my answer, they overcame the sizing issue by making it ill-formed if it does not fit. – Shafik Yaghmour Jun 16 '15 at 16:09
6

For C++ this has been addressed by Evolution Working Group issue 119: Adding u8 character literals whose Motivation section says:

We have five encoding-prefixes for string-literals (none, L, u8, u, U) but only four for character literals -- the missing one is u8. If the narrow execution character set is not ASCII, u8 character literals would provide a way to write character literals with guaranteed ASCII encoding (the single-code-unit u8 encodings are exactly ASCII). Adding support for these literals would add a useful feature and make the language slightly more consistent.

EWG discussed the idea of adding u8 character literals in Rapperswil and accepted the change. This paper provides wording for that extension.

This was incorporated into the working draft using the wording from N4267: Adding u8 character literals and we can find the wording in at this time latest draft standard N4527 and note as section 2.14.3 say they are limited to code points that fit into a single UTF-8 code unit:

A character literal that begins with u8, such as u8'w', is a character literal of type char, known as a UTF-8 character literal. The value of a UTF-8 character literal is equal to its ISO10646 code point value, provided that the code point value is representable with a single UTF-8 code unit (that is, provided it is a US-ASCII character). A UTF-8 character literal containing multiple c-chars is ill-formed.

Shafik Yaghmour
  • 154,301
  • 39
  • 440
  • 740
0

If you don't trust that your compiler will treat '0' as ASCII character 0x30, then you could use static_cast<char>(0x30) instead.

Edward Loper
  • 15,374
  • 7
  • 43
  • 52
  • 2
    OP asks for reasoning, not for propositions to implement such guarantees by hand... – Griwes Jun 07 '12 at 19:34
  • @Griwes that's a reasonable point -- how about this for a reason: it's overkill to add a new syntax for something you can already do (using the static_cast I gave above, or just `char(30)` if you don't want to type that much). – Edward Loper Jun 07 '12 at 19:37
  • 2
    It would add immensely to readability. With that logic, why have character literals at all? – jbatez Jun 07 '12 at 19:43
  • You could probably make a case for the encoding not mattering so long as it's consistent across programs written for the same platform, but our computers are highly networked today. It wouldn't bug me so much if their weren't `u8"string"` literals which guarantee the encoding. But, clearly, since those exist, any standard-compliant compiler would already have the logic to map source characters to UTF-8, and thus ASCII, characters. – jbatez Jun 07 '12 at 19:48
0

As you are aware, UTF-8-encoded characters need several octets, thus chars, so the natural type for them is char[], which is indeed the type for a u8-prefixed string literal! So C11 is right on track here, just that it sticks to its syntax conventions using " for a string, needing to be used as an array of char, rather than your implied semantic-based proposal to use ' instead.

About "0" versus u8"0", you are reading right, only the latter is guaranteed to be identical to { 0x30, 0 }, even on EBCDIC systems. By the way, the very fact the former is not can be handled conveniently in your code, if you pay attention to the __STDC_MB_MIGHT_NEQ_WC__ predefined identifier.

AntoineL
  • 888
  • 4
  • 25