Meaning of character literals containing trigraphs for non-representable characters

Question

On a C compiler which uses ASCII as its character set, the value of the character literal '??<' would be equivalent to that of '{', i.e. 0x7B. What would be the value of that literal on a compiler whose character set doesn't have a { character?

Outside a string literal, a compiler could infer that ??< is supposed to have the same meaning as an open-brace character is defined to have, even if the compiler character set doesn't have an open-brace character. Indeed, the whole purpose of trigraphs is to allow the use of sequences of representable characters to be used in place of characters that aren't representable. The spec requires that trigraphs even be processed within string literals, however, which has me puzzled. If a compiler's character set includes a { character, the compiler can allow '{' to be represented as '??<', but the character set includes { I see no reason a programmer wouldn't simply use that. If the character set doesn't include {, however, which would seem the only reason for using trigraphs in the first place, what representable character would a compiler be expected to replace ??< with?

The character set might contain `{`, but it might not be easy or even possible to type `{` using the keyboard used to write the program. — Mankarse, Aug 26 '14 at 04:20
The same could be said for many thousands of characters one might want to include within character or string literals; I'm not sure anything is special about `{` or any of the other trigraph characters in that regard. If the character set is known, one could simply use `0x7B` or `'\x7B` even if one couldn't type `{`. And for string (vs character) literals, a "stringize" macro could probably yield better looking results `#define LBR __stringize(??<)` would define `LBR` as `"{"` [whatever the `{` character is]. — supercat, Aug 26 '14 at 04:29
@supercat That is one reasonable solution, but the C89 standardization committee happened to pick a different solution. — Potatoswatter, Aug 26 '14 at 04:38
`'\x7B'` and `'??<'` are different. The former has the value 123, the latter the character code of `{` (which is different for non-ASCII systems). You couldn't write portable code doing the latter without a keyboard having a `{` key. — mafso, Aug 26 '14 at 12:02
Maybe take a look at [the C89 rationale about trigraphs](http://www.lysator.liu.se/c/rat/b.html#2-2-1-1). — mafso, Aug 26 '14 at 12:07
@mafso: With regard to 0x7B, I said "if the character set is known". Otherwise, do any trigraphs other than `??/` provide any functionality which couldn't have been accomplished by specifying a .h file with macros for characters [e.g. `#define __LBR {`/`#define __clbr 0x7B`/`#define __SLBR __STRINGIZE(<:)`]? — supercat, Aug 26 '14 at 12:15
The character set is never known for strictly conforming code. Maybe you want to target ASCII and EBCDIC systems and write code with a keyboard lacking certain keys. About your macro question: Yes, seems possible, from the link above: _Some users may wish to define preprocessing macros for some or all of the trigraph sequences._ You can write that header yourself if you want to, the other way round wouldn't be possible, so the way chosen is at least the more flexible way. — mafso, Aug 26 '14 at 12:29
@mafso: The quoted rationale talks about character sets which *don't have* certain glyphs. I see nothing about ease of typing. If one is using a system which has mapped 0x7B to `é` and 0x7D to `è`, and which doesn't *have* glyphs for `{` and `}`, I can see that `int main(void) <: doSomething(); :>` could be better than `int main(void) é doSomething(); é`. I would guess that `"l'??ve"` would probably render as `"l'éléve"` rather than `"l'{l}ve"` on such a machine, but I don't know of anything in the spec that would say that. — supercat, Aug 26 '14 at 12:53
You're right. While trigraphs may be useful for keyboards lacking corresponding keys, that doesn't seem to have been a reason to include them in the language. And I don't think the C standard says anything about representation, see for example 2.2.1 of the rationale _[...] the common Japanese practice of using the glyph ¥ for the C character \ is perfectly legitimate._ — mafso, Aug 26 '14 at 13:27
Back to your actual question: Probably it can be replaced (in a string constant) with everything what would be recognized as the corresponding character. For example, think about code generation: The output of `printf("int main(void) ??< ??>??/n");` should be compilable on that platform. — mafso, Aug 26 '14 at 13:28
@mafso: That would make sense; I wonder if there's anything that officially defines things in such terms (e.g. saying that there must exist a single-byte character that a compiler will recognize syntactically in the fashion defined for `{`, and `??<` must expand to that character). If you can find anything that clearly specifies that and write an answer, I'll accept it. — supercat, Aug 26 '14 at 15:21

Igor Tandetnik · Answer 1 · 2014-08-26T04:48:17.127

5

What would be the value of that literal on a compiler whose character set doesn't have a { character?

There is no such (conforming) compiler. { is part of the basic source character set (5.2.1/3 in C99, [lex.charset]/1 in C++11). The basic execution character set (what the program uses at run-time) shall contain at least all the members of the basic source character set (the same 5.2.1/3 in C99, [lex.charset]/3 in C++11).

As @Mankarse notes, trigraphs were invented not to support compilers that lacked certain characters (again, there are no such compilers), but to support humans typing at keyboards that lacked keys necessary to enter those characters.

edited Aug 26 '14 at 04:48

answered Aug 26 '14 at 04:32

Igor Tandetnik

50,461
4
56
85

1

[lex.charset] is a C++ cross-reference. This question is not about C++. The corresponding C11 paragraph is §5.2.1/3. – Potatoswatter Aug 26 '14 at 04:41
1

Also, the requirement in both C and C++ is a "shall," which is an absolute requirement, stronger than "should." – Potatoswatter Aug 26 '14 at 04:43
@Potatoswatter: Ah, indeed, I wasn't paying attention, and just assumed C++. I'll correct the answer shortly. – Igor Tandetnik Aug 26 '14 at 04:45
Where did you read that rationale for trigraphs? The rationale quoted in a comment on the question suggests that the purpose was to allow for systems which had a limited number of glyphs they could display or print, and which allocated character codes to glyphs like `é` and `è` rather than `{`and `}`. – supercat Aug 26 '14 at 12:42
@supercat [Digraphs and trigraphs](http://en.wikipedia.org/wiki/Digraphs_and_trigraphs): "The basic character set of the C programming language is a subset of the ASCII character set that includes nine characters which lie outside the ISO 646 invariant character set. This can pose a problem for writing source code when the encoding (and possibly keyboard) being used does not support any of these nine characters. The ANSI C committee invented trigraphs as a way of entering source code using keyboards that support any version of the ISO 646 character set." – Igor Tandetnik Aug 26 '14 at 12:56
@IgorTandetnik: "And possibly keyboard". That would imply that the issue isn't just with *typing* the characters in question--the glyphs don't exist in the character encoding. A machine which renders 0x7B and 0x7D to `é` and `è` would probably have a way of typing `é` and `è`, but may have no way of typing or displaying `{` and `}`, or it might map `{` and `}` to 0xFB and 0xFD [I've seen machines where 0x20-0x7E render using one selectable character set, and 0xA0-0xFE render as another]. I would guess that even on a machine where 0xFB and 0xFD are needed for `{` and `}` – supercat Aug 26 '14 at 13:07
@IgorTandetnik: because 0x7B and 0x7D render as `é` and `è`, a compiler would probably render `??<` and `??>` as `é` and `è`, but I don't know of anything in the spec that says that. – supercat Aug 26 '14 at 13:08
@supercat The compiler doesn't "render" anything. I'm not sure what you are talking about. If you are talking about encoding of a physical file on disk, then **5.1.1.2/1** "Physical source file multibyte characters are mapped, in an **implementation-defined manner**, to the source character set" (emphasis mine). I suppose it's possible that, keyboard aside, there once existed text editors unable to represent `{` (saving files in ISO 646 encoding), and trigraphs were invented to support writing code in those editors. It is text editors that "render" characters in a text file, not compilers. – Igor Tandetnik Aug 26 '14 at 13:56
@IgorTandetnik: Historically it was *display hardware* that was responsible for character rendering. If the issue had simply been the keyboard, a programmer could simply write a program which would translate alternative characters or character sequences into the C character set and feed a program through that before feeding it to the C compiler. Also, the quote above refers to multi-byte characters, which is a separate issue. – supercat Aug 26 '14 at 14:13

score 1 · Accepted Answer · edited Jun 20 '20 at 09:12

When it comes to considerations about the environment, especially to files, the C standard intentionally becomes rather vague. The following guarantees are made about trigraphs and the encoding of their corresponding characters:

C11 (n1570) 5.1.1.2 p1 (“Translation phases”) [emph. mine]

Physical source file multibyte characters are mapped, in an implementation-defined manner, to the source character set (introducing new-line characters for end-of-line indicators) if necessary. Trigraph sequences are replaced by corresponding single-character internal representations.

Thus, the trigraph sequence must be mapped to a single byte. This single-byte character must be in the basic character set different from any other character in the basic character set. How the compiler handles them internally during translation isn’t really observable behaviour, so it’s irrelevant.

If written to a text stream it may be converted (as I read it, maybe back to a trigraph sequence if the underlying encoding doesn’t have an encoding for a certain character). It can be read back again, and must compare equal if it is considered a printing character. Ibid. 7.21.2 p2:

[…] Data read in from a text stream will necessarily compare equal to the data that were earlier written out to that stream only if: the data consist only of printing characters and the control characters horizontal tab and new-line; no new-line character is immediately preceded by space characters; and the last character is a new-line character. […]

Ibid. 7.4 p3:

The term printing character refers to a member of a locale-specific set of characters, each of which occupies one printing position on a display device; the term control character refers to a member of a locale-specific set of characters that are not printing characters.^*) All letters and digits are printing characters.

^*) In an implementation that uses the seven-bit US ASCII character set, the printing characters are those whose values lie from 0x20 (space) through 0x7E (tilde); the control characters are those whose values lie from 0 (NUL) through 0x1F (US), and the character 0x7F (DEL).

And for binary streams, ibid. 7.21.2 p3:

A binary stream is an ordered sequence of characters that can transparently record internal data. Data read in from a binary stream shall compare equal to the data that were earlier written out to that stream, under the same implementation. Such a stream may, however, have an implementation- defined number of null characters appended to the end of the stream.

In the comments above, the question arose if

printf("int main(void) ??< ??>\n");     // (1) 
printf("int main(void) ?\?< ?\?>\n");   // (2)

always works for code generation and the output of that statement is guaranteed to be compilable. I couldn’t find a normative reference requiring isprint('??<') etc. (for (1)) or even isprint('<') etc (for (2)) to return non-zero, but the C89 rationale about streams says:

The set of characters required to be preserved in text stream I/O are those needed for writing C programs; the intent is the Standard should permit a C translator to be written in a maximally portable fashion. Control characters such as backspace are not required for this purpose, so their handling in text streams is not mandated.

When '??<' etc. is written to a binary stream, it must map to a single byte, be printed as such, be unique and distinguishable from any other basic character, and compare equal to '??<' when read back.

Related: C89 rationale about trigraphs.

Thanks. So a system could arbitrarily select character codes for `??<` and `'??>'`, but they would need to be distinct printing characters. I'm curious what any C compilers for the Commodore 64 might have done; I think there were some, but that machine didn't have glyphs shaped like `~`, `\ `, `{`, or `}`; and the only things resembling `|` and `_` were a box-drawing characters (a centered vertical line and a bottom-of-box horizontal line). Ascii 0x5E was an up-arrow (close enough to `^` to simply call it that), but 0x5F was a back-arrow. If I were designing a compiler for that system... — supercat, Aug 29 '14 at 15:10
...I'd probably interpret the box-bottom character as synonymous with back-arrow in identifiers but not literals, and probably accept some box-drawing characters as synonymous with braces or pipe (they were easy to type on the keyboard). I'd probably accept `£` as synonymous with backslash (it's code 0x5C) since I can't think of any other graphic character that would really be better. Not sure about tilde; maybe a top-of-box character (since its meaning is similar to an overbar in digital signal descriptions). — supercat, Aug 29 '14 at 15:15
I'm not sure about them being printing characters. The rationale says this is intended (or at least, that they can be written to and read back from a _text stream_), so in a way, they need to be printable. On the other hand, only the digits and letters are explicitly required to be printing characters. — mafso, Aug 29 '14 at 15:23
Your C64 example is a good one for why trigraphs may be handy: Suppose you code on that machine. You _could_ use the characters you mentioned directly in the source code. If you now wanted to port your code to a different machine (say, using UTF-8), all the non-ISO646 characters would be converted wrongly by a C64-to-UTF8 converter, but trigraphs would be converted correctly (to trigraphs). — mafso, Aug 29 '14 at 15:26
Indeed, I would see that as a basis for supporting trigraphs outside quotes, and I have no objection to those (though `??<` and `??>` are redundant with the IMHO vastly superior `<:` and `:>`). The only expressiveness gained by parsing trigraphs within quotes, however, is the ability to use backslash escapes like `"??/n"`, and I would think there would be better ways of accomplishing that [e.g. specify that if a string literal is preceded by a (possibly trigraph) hashmark and another special character, that character will substitute for backslash until the next quote. Thus... — supercat, Aug 29 '14 at 15:57
...`char * foo = #$"foo\bar$n"` would set `foo` equal to a string containing a backslash character and a newline]. If a compiler uses `┤` and `├` odd-ball character sets for braces that couldn't reliably translate to ASCII, having `printf("int main() ??;");` use those same characters would make sense, though any problems one would have porting the code if one used `printf("int main() ┤doSomething();├");` would apply equally to the output of the trigraph code, while `printf("int main() <: doSomething();:>");` would have translation problems with neither code nor output. — supercat, Aug 29 '14 at 16:07
Hmm... I don't get your point why the problems with `printf("int main() ┤doSomething();├");` also apply to `printf("int main() ??;");`... When I convert them to my machine (using UTF-8), the output of the former doesn't compile, but the output of the latter does. — mafso, Aug 29 '14 at 16:11
On your machine, the latter code would output `int main {doSomething();}`, but on a machine which used `┤` and `├` as braces, it would output `int main ┤doSomething();├`. Incidentally, I looked briefly at C64 C compilers and it looks as though they have built-in editors which reprogram the character set to include ASCII characters, which now makes me curious about how string literals should get interpreted. The C64 has two selectable pre-loaded character sets; one defines 0x53 and 0x73 as `S` and `♥`; the other as `s` and `S` [*in that order*]. In ASCII, they're `S` and `s` [other order]. — supercat, Aug 29 '14 at 16:31

Meaning of character literals containing trigraphs for non-representable characters

2 Answers2

Linked