0

I'm currently working on a toy language that works like this: one can embed blocks written in this language into a C++ source, and before compilation, these blocks are translated into C++ in an extra preprocessing step, producing a valid C++ source.

I want to make sure that these blocks can always be identified in the source unambiguously and also, whenever such a block is present in the source, it cannot be valid C++. Moreover, I want to achieve these by putting as few constraints to the embedded language as possible (the language itself is still somewhat fluid).

The obvious way would be to introduce a pair of special multi-character parentheses, made of characters that cannot appear together in valid C++ code (or in the embedded language). However, I'm not sure how to ensure that particular a character sequence is good for this purpose (not after GotW #78, anyway (: ).

So what is a good way to escape these blocks?

xcvii
  • 450
  • 3
  • 17
  • One thing to be wary of is [Digraphs and Trigraphs](http://en.wikipedia.org/wiki/Digraphs_and_trigraphs) – user123 May 12 '13 at 11:53
  • 1
    A raw string literal can contain any sequence of characters (excluding invalid unicode sequences), so there are no isolated sequences of characters that can never appear in C++ source. You will need to write a simple C++ lexer (or use a pre-written on like [Boost.Wave](http://www.boost.org/doc/libs/1_53_0/libs/wave/)), and then just use any sequence of characters that is not part of a literal and which is not an identifier or operator. – Mankarse May 12 '13 at 12:01

1 Answers1

2

If your compiler can be made to accept C++11 standard, you could use raw string literals like eg:

  std::cout << R"*(<!DOCTYPE html>
       <html>
       <head>
       <title>Title with a backslash \ here 
     and double " quote</title>)*";

Hence with raw string literals there is no forbidden sequence of characters in those raw string literals. Any sequence of characters could appear in them (but you can define the ending sequence of the raw string)


And you could use #{ and }# like I do in MELT macro-strings; MELT is Lisp-like domain specific language to extend GCC, and you can embed code in it with e.g.

(code_chunk hellocount_chk
            #{ /* $HELLOCOUNT_CHK chunk */ 
                 static int $HELLOCOUNT_CHK#_counter; 
                 $HELLOCOUNT_CHK#_counter++;
               $HELLOCOUNT_CHK#_lab:
                 printf ("Hello World, counted %d\n", 
                         $HELLOCOUNT_CHK#_counter);
                 if (random() % 4 == 0) goto $HELLOCOUNT_CHK#_lab;
            }#)

The #{ and }# are enclosing macro-strings (these character sequences are unlikely to appear in C or C++ code, except in string literals and comments), with the $ starting symbols in such macro-strings (up to a non-letter or # character).

Using #{ and }# is not fool-proof (e.g. because of raw string literals) but good enough: a cooperative user could manage to avoid them.

Basile Starynkevitch
  • 223,805
  • 18
  • 296
  • 547
  • That's not how I want it to work, though. I'm writing a translator that takes the foreign blocks, produces C++ from them and inserts them back into the code, before even the preprocessor sees any of the source. – xcvii May 12 '13 at 12:00
  • MELT is doing the same: it is translating MELT language into C++ (before and independently of the C++ preprocessor) – Basile Starynkevitch May 12 '13 at 12:03
  • Fair enough. I wrote the comment before your edit where you added the part about MELT. Edit: or, more likely, I just missed that part somehow, sorry! – xcvii May 12 '13 at 12:14