1

I'm trying to understand both trigraphs and digraphs rather than use them.

I've read that post and I understood that:

  • Converting trigraphs to corresponding characters shall always be done by the preprocessor, before the actual compilation starts.
  • Converting digraphs to corresponding characters shall be performed by the compiler.

Is this true?

Cătălina Sîrbu
  • 1,253
  • 9
  • 30

2 Answers2

3

Trigraph sequences are indeed replaced with the corresponding character at the first phase of the compiling process, before the preprocessor lexer analyses the stream of characters to produce preprocessor tokens.

The very next phase handles escaped newlines, ie: instances of \ immediately followed by a newline, which are removed from the character stream. Note that the \ can be produced by the first phase as a replacement for the ??/ trigraph.

The lexer then analyses the character stream to produce preprocessing tokens, such as [, and <: which are alternate spellings for the same token, just like 1e1 and 1E1, hence <: is not replaced with [, it is a different sequence of characters producing the same token.

Trigraphs cannot be produced by token pasting using the ## preprocessor operator in macro expansions, but digraphs can.

Here is a small sample program to illustrate this process, including th special handing of the ??/ trigraph that expands to \, thus can be used in the middle of a digraph split on 2 lines:

#include <stdio.h>

#define STR(x) #x
#define xSTR(x) STR(x)
#define glue(a,b) a##b

int main() {
    puts(STR(??!));
    puts(STR('??!'));
    puts(STR("??!"));

    puts(STR(<:));
    puts(STR('<:'));
    puts(STR("<:"));

    puts(STR(<\
:));
    puts(STR(<??/
:));
    puts(STR('<\
:'));
    puts(STR("<\
:"));

    puts(STR(glue(<,:)));
    puts(xSTR(glue(<,:)));
    return 0;
}

Output:

chqrlie $ make lexing && ./lexing
clang -O3 -funsigned-char -std=c11 -Weverything -Wwrite-strings  -lm -o lexing lexing.c
lexing.c:8:14: warning: trigraph converted to '|' character [-Wtrigraphs]
    puts(STR(??!));
             ^
lexing.c:9:15: warning: trigraph converted to '|' character [-Wtrigraphs]
    puts(STR('??!'));
              ^
lexing.c:10:15: warning: trigraph converted to '|' character [-Wtrigraphs]
    puts(STR("??!"));
              ^
lexing.c:18:15: warning: trigraph converted to '\' character [-Wtrigraphs]
    puts(STR(<??/
              ^
4 warnings generated.
|
'|'
"|"
<:
'<:'
"<:"
<:
<:
'<:'
"<:"
glue(<,:)
<:
chqrlie
  • 131,814
  • 10
  • 121
  • 189
  • What's interesting about the order of phases is that the `\ ` in a line continuation could have been originally typed as `??/`, further complicating attempts to decompose C with regular expressions. Fortunately, these arcane corners of the C standard are slowly but surely being removed. – rici Jul 25 '20 at 17:55
  • @rici: I updated the answer to address your remark. – chqrlie Jul 29 '20 at 13:29
2

Digraphs are not "converted to the corresponding character." The string literal "<:" contains the two characters < and : (plus a null terminator). Contrast that with the string "??(" if you have a compiler which supports trigraphs.

<: is simply a token with exactly the same syntactic significance as [. But it is never converted to [. If you pass it to the stringify operator #, you will get the string "<:".

rici
  • 234,347
  • 28
  • 237
  • 341