5

When defining the grammar for a language parser, how do you deal with things like comments (eg /* .... */) that can occur at any point in the text?

Building up your grammar from tags within tags seems to work great when things are structured, but comments seem to throw everything.

Do you just have to parse your text in two steps? First to remove these items, then to pick apart the actual structure of the code?

Thanks

Jagu
  • 2,471
  • 2
  • 22
  • 26

2 Answers2

4

Normally, comments are treated by the lexical analyzer outside the scope of the main grammar. In effect, they are (usually) treated as if they were blanks.

Jonathan Leffler
  • 730,956
  • 141
  • 904
  • 1,278
2

One approach is to use a separate lexer. Another, much more flexible way, is to amend all your token-like entries (keywords, lexical elements, etc.) with an implicit whitespace prefix, valid for the current context. This is how most of the modern Packrat parsers are dealing with whitespaces.

SK-logic
  • 9,605
  • 1
  • 23
  • 35
  • I'm having trouble parsing your recommendation :-). In an EBNF token rule such as `hash = "#" ;`, how would it be amended to include an "implicit whitespace prefix"? Implicit normally means inferred from something whereas explicit means that the rule is changed to specify the prefix, e.g., `hash = [WS] "#" ;`. A tiny grammar example would help. – Dave Oct 30 '16 at 22:55
  • @Dave, The rule I'm using (not necessarily the best one, feel free to experiment) is to add a whitespace to any token node, e.g, for your `hash = "#"` this conversion would yield `hash = [whitespace]* "#"`, and the same for all the token nodes you may implicitly lift from your PEG expressions (e.g., if you have an expression `atom = { "(" [expr] ")" } / ...`, you will have two implicit token nodes `"("` and `")"`, with whitespace added to both. You can see the details of my implementation here: https://github.com/combinatorylogic/mbase/blob/master/src/l/lib/parsing/compiler.al – SK-logic Oct 31 '16 at 00:42