antlr v3 context-aware conditional comment inclusion

Question

I'm modifying a DSL grammar for a product that is in public use. Currently all /*...*/ comments are silently ignored, but I need to modify it so that comments that are placed before certain key elements are parsed into the AST. I need to maintain backwards compatibility whereby users can still add comments arbitrarily throughout the DSL and only those key comments are included.

The parser grammar currently looks a bit like this:

grammar StateGraph;
graph: 'graph' ID '{' graph_body '}';
graph_body: state+;
state: 'state' ID '{' state_body '}';
state_body: transition* ...etc...; 
transition: 'transition' (transition_condition) ID ';';
COMMENT: '/*' ( options {greedy=false;} : . )* '*/' {skip();}

Comments placed before the 'graph' and 'state' elements contain meaningful description and annotations and need to be included within the parsed AST. So I've modified those two rules and am no longer skipping COMMENT:

graph: comment* 'graph' ID '{' graph_body '}';
state: comment* 'state' ID '{' state_body '}';
COMMENT: '/*' ( options {greedy=false;} : . )* '*/'

If I naively use the above, the other comments cause mismatched token errors when subsequently executing the tree parser. How do I ignore all instances of COMMENT that are not placed in front of 'graph' or 'state'?

An example DSL would be:

/* Some description
 * @some.meta.info
 */
graph myGraph {
  /* Some description of the state.
   * @some.meta.info about the state
   */
  state first {
    transition if (true) second; /* this comment ignored */
  }

  state second {
  }

  /* this comment ignored */
}

One issue I notice is that the Parser needs to know in advance whether to expect a COMMENT or not in relation to every other possible element. If it encounters an unexpected COMMENT then either the COMMENT itself is mismatched, or I get an no viable alternative on the next element (eg: comment followed by transition). The solution seems to be to push comments into the HIDDEN channel and then conditionally extract it....but that's proving difficult. — Toaomalkster, Apr 17 '12 at 00:31

score 1 · Accepted Answer · answered Apr 17 '12 at 04:07

This is the solution I've actually got working. I'd love feedback.

The basic idea is to send comments to the HIDDEN channel, manually extract them in the places where I want them, and to use rewrite rules to re-insert the comments where needed. The extraction step is inspired by the information here: http://www.antlr.org/wiki/pages/viewpage.action?pageId=557063.

The grammar is now:

grammar StateGraph;

@tokens { COMMENTS; }

@members {
// matches comments immediately preceding specified token on any channel -> ^(COMMENTS COMMENT*)
CommonTree treeOfCommentsBefore(Token token) {
    List<Token> comments = new ArrayList<Token>();
    for (int i=token.getTokenIndex()-1; i >= 0; i--) {
       Token t = input.get(i);
       if (t.getType() == COMMENT) {
          comments.add(t);
       }
       else if (t.getType() != WS) {
          break;
       }
    }
    java.util.Collections.reverse(comments);

    CommonTree commentsTree = new CommonTree(new CommonToken(COMMENTS, "COMMENTS"));
    for (Token t: comments) {
       commentsTree.addChild(new CommonTree(t));
    }
    return commentsTree;
}
}

graph
    : 'graph' ID '{' graph_body '}'
      -> ^(ID {treeOfCommentsBefore($start)} graph_body);
graph_body: state+;
state
    : 'state' ID '{' state_body '}'
      -> ^(ID {treeOfCommentsBefore($start)} staty_body);
state_body: transition* ...etc...; 
transition: 'transition' (transition_condition) ID ';';
COMMENT: '/*' .* '*/' {$channel=HIDDEN;}

This is a solution I used myself in one of my earlier grammars but wasn't sure whether this is a good solution, because I really don't like it. It seems way to complicated and cumbersome, but it seems it is the only viable option right now, too bad. — stryba, Apr 17 '12 at 08:51
Agreed, it seems that we need a more direct way of selecting whether the parser matches from the default channel or the hidden one. — Toaomalkster, Apr 17 '12 at 21:11

score 0 · Answer 2 · answered Apr 16 '12 at 13:56

0

Does this work for you?

grammar StateGraph;
graph: 'graph' ID '{' graph_body '}';
graph_body: state+;
state: .COMMENT 'state' ID '{' state_body '}';
state_body: .COMMENT transition* ...etc...; 
transition: 'transition' (transition_condition) ID ';';
COMMENT: '/*' ( options {greedy=false;} : . )* '*/' {skip();}

answered Apr 16 '12 at 13:56

stryba

1,979
13
19

No, as soon as you `skip()` a token, you cannot use it in a parser rule. Why the `.` in front of it, btw? – Bart Kiers Apr 16 '12 at 15:17
Because that dot supposingly tells Antlr to look on the hidden channel for the token following it. I'm not sure whether your skip() puts the token on the hidden channel though. Did you actually test it? – stryba Apr 16 '12 at 19:59
Err, no, a `.` in a parser rule matches any token on the channel the parser currently operates on (this is the default channel, if not changed), it does not get a token from the hidden channel. Also, `skip()` discards the token completely from the lexer, it does not put the token on another channel. So I could ask the same to you: did *you* test it? :) – Bart Kiers Apr 16 '12 at 20:26
I am curious though: where did you get the idea that the `.` causes ANTLR to read from another channel? – Bart Kiers Apr 16 '12 at 20:30
@BartKiers: I somehow thought I read it somewhere. I'm obviously know what a `.` in ANTLR means. Here i meant `.COMMENT` as one entity. You are right I didn't test it as I usually don't have the time for it. Like I said off the top of my head it said use `.COMMENT`. – stryba Apr 16 '12 at 22:57
Oh and btw, I just saw the first comment was you and not the OP, so my apologies of course you don't need to test it either the OP should. Anyway just did a quick google and found this: `// move comments in front of assignments to end assign: .COMMENT ID '=' expr ';' -> ID '=' expr ';' COMMENT where the dot in front means "hidden" like in UNIX filenames.` Coming from Terrence's blog, but apparently its not yet implemented. I knew I read it somewhere :) – stryba Apr 16 '12 at 23:00
Ah, I see. Yeah, it would be neat if that were possible. I don't think it will be implemented though, seeing the blog entry is from way back in 2007... But who knows, perhaps v4 of ANTLR will have it. And your remark *"obviously know what a . in ANTLR means"*, no, that is not obvious. How should I know that? Besides, many people think they know ANTLR but still don't know what the difference between a `.` in a lexer rule and a `.` in a parser rule is. – Bart Kiers Apr 17 '12 at 07:27

Bart Kiers · Answer 3 · 2012-04-16T15:37:34.430

0

How do I ignore all instances of COMMENT that are not placed in front of 'graph' or 'state'?

You can do that by checking after the closing "*/" of a comment if there is either 'graph' or 'state' ahead, with some optional spaces in between. If this is the case, don't do anything, and if that's not the case, the predicate fails and you fall through the rule and simply skip() the comment token.

In ANTLR syntax that would look like:

COMMENT
 : '/*' .* '*/' ( (SPACE* (GRAPH | STATE))=> /* do nothing, so keep this token */ 
                | {skip();}                  /* or else, skip it               */ 
                )
 ;

GRAPH  : 'graph';
STATE  : 'state';
SPACES : SPACE+ {skip();};

fragment SPACE : ' ' | '\t' | '\r' | '\n';

Note that .* and .+ are ungreedy by default: no need to set options{greedy=false;}.

Also, be aware that you don't use SPACES in your COMMENT rule since SPACES executes the skip() method, when called!

edited Apr 16 '12 at 15:37

answered Apr 16 '12 at 15:23

Bart Kiers

166,582
36
299
288

Thanks @BartKiers, that is the most concrete solution I have so far. Unfortunately the solution does not scale well because now COMMENT needs to know about all its _possible_ contexts. For example, 'state' is actually preceded by 'initial' or 'final' or both, so I have to also include that logic within the predicate rule. This is just a cut-down version of the grammer so that predicate rule gets very complex very quickly. – Toaomalkster Apr 17 '12 at 00:28
http://www.antlr.org/wiki/display/ANTLR3/Rule+and+subrule+options says that greedy=true by default – Toaomalkster Apr 17 '12 at 05:05
...but rethinking that, `.*` would _have_ to be non-greedy or it would never work. – Toaomalkster Apr 17 '12 at 05:08
@Toaomalkster, I see, yes, then this solution might not be the best: your lexer might have to do a lot of looking ahead... – Bart Kiers Apr 17 '12 at 07:28
@Toaomalkster, `*` and `+` are greedy by default except when preceded by `.`. See Terence Parr's *Definitive ANTLR Reference*, chapter 4: *Extended BNF Subrules*, page 86. – Bart Kiers Apr 17 '12 at 07:29

antlr v3 context-aware conditional comment inclusion

3 Answers3