how to parse only some comments using ANTLRv4

Question

I devel application analyzing Java source code using ANTLRv4. I claim to match all single-line comments with first token TODO (e.g. // TODO <some-comment>) together with directly following statement.

Sample code:

class Simple {
    public static void main(String[] args) {
        // TODO develop cycle
        for (int i = 0; i < 5; i++) {
            // unmatched comment
            System.out.println("hello");
        }
        // TODO atomic
        int a;

        // TODO revision required
        {
            int b = a+4;
            System.out.println(b);
        }
    }
}

Result = map like this:

"develop cycle" -> for(...){...}
"atomic" -> int a
"revision required" -> {...}

Following official book (1) and similar topics on stackoverflow ((2), (3), (4), (5), (6)) I tried several ways.

At first I hoped for special COMMENTS channel as described in (1) and (2) but error rule 'LINE_COMMENT' contains a lexer command with an unrecognized constant value; lexer interpreters may produce incorrect output occured.

I guess it would be much nicer to parse the source code in a way of ignoring all single-line comments BUT those beginning by TODO. I hope it is possible to add todo-comments directly into AST in order to use listeners/walkers. Than I'd only need register listener/walker for TODO comment and extract following statement, adding both to desired map.

I've been modifing official Java8 gammar for two days but without any success. Either compiler complains or AST is mismashed.

This is update I made:

// ...
COMMENT
    :   '/*' .*? '*/' -> skip
    ;

TODO_COMMENT
    :   '// TODO' ~[\r\n]*
    ;

LINE_COMMENT
    :   '//' ~[\r\n]* -> skip
    ;

Can anyone help me please? Grammars are not my cup of tea. Thanks in advance

EDIT1:

Grammar modification posted above complies without error, but following tree is generated (please note the red marked nodes including int)

error AST

EDIT2:

Assuming code sample above, while calling parser.compilationUnit(); following error is generated

line 3:2 extraneous input '// TODO develop cycle;' expecting {'abstract', 'assert', 'boolean', 'break', 'byte', 'char', 'class', 'continue', 'do', 'double', 'enum', 'final', 'float', 'for', 'if', 'int', 'interface', 'long', 'new', 'private', 'protected', 'public', 'return', 'short', 'static', 'strictfp', 'super', 'switch', 'synchronized', 'this', 'throw', 'try', 'void', 'while', IntegerLiteral, FloatingPointLiteral, BooleanLiteral, CharacterLiteral, StringLiteral, 'null', '(', '{', '}', ';', '<', '!', '~', '++', '--', '+', '-', Identifier, '@'}
line 8:2 extraneous input '// TODO atomic;' expecting {'abstract', 'assert', 'boolean', 'break', 'byte', 'char', 'class', 'continue', 'do', 'double', 'enum', 'final', 'float', 'for', 'if', 'int', 'interface', 'long', 'new', 'private', 'protected', 'public', 'return', 'short', 'static', 'strictfp', 'super', 'switch', 'synchronized', 'this', 'throw', 'try', 'void', 'while', IntegerLiteral, FloatingPointLiteral, BooleanLiteral, CharacterLiteral, StringLiteral, 'null', '(', '{', '}', ';', '<', '!', '~', '++', '--', '+', '-', Identifier, '@'}
line 11:2 extraneous input '// TODO revision required;' expecting {'abstract', 'assert', 'boolean', 'break', 'byte', 'char', 'class', 'continue', 'do', 'double', 'enum', 'final', 'float', 'for', 'if', 'int', 'interface', 'long', 'new', 'private', 'protected', 'public', 'return', 'short', 'static', 'strictfp', 'super', 'switch', 'synchronized', 'this', 'throw', 'try', 'void', 'while', IntegerLiteral, FloatingPointLiteral, BooleanLiteral, CharacterLiteral, StringLiteral, 'null', '(', '{', '}', ';', '<', '!', '~', '++', '--', '+', '-', Identifier, '@'}

So obviously grammar is incorect as it struggles with simple example

What are the exact errors you get for the grammar excerpt above? What kind of mismatches occcur? — Uwe Allner, Jun 18 '14 at 11:08
@UweAllner the edit made, please see the scheme provided - Why the `int` is also marked? — petrbel, Jun 18 '14 at 11:18
@ptrbel For me this looks quite correct. The int is not part of the `// TODO xxx` but follows it directly in the next block statement subtree. As I understood your problem, this seems really the desired result. What did you expect instead? — Uwe Allner, Jun 18 '14 at 11:40
@UweAllner e.g. ANTLR didn't generate `Java8BaseVisitor::visitTODO_COMMENT` - how can I match directly following statement without visitor? — petrbel, Jun 18 '14 at 12:20
As you only have a lexer rule for this token, there is no explicit method for that created in the visitor. Instead visitTerminal is called for it. Either you use this method to extract your map key and e.g. set a state "rememberForTodo" in your parser, or you can to define a parser rule (with name beginning with lowercase letter) containing your TODO_COMMENT; then a visitor rule for that is generated. — Uwe Allner, Jun 18 '14 at 12:42
@UweAllner I updated the question, could you please take a look at **EDIT2**? How exactly do you mean "define a parse rule"? I tried `todoComment : TODO_COMMENT ;` or even `todoComment : TODO_COMMENT Identifier* ;` which really generated corresponding methods but listener including method `enterTodoComment` was never triggered... Also printage of all terminals doesn't include those TODO comments :-( — petrbel, Jun 18 '14 at 13:09
imho the int is only marked because it directly follows the "extra" `TODO_COMMENT`. I would ignore this error especially since the error messages don't report an error for the `int`. — Onur, Jun 23 '14 at 12:09

Onur · Accepted Answer · 2014-06-23T12:17:01.183

The reason is that you don't expect your special comment in any parser rule, i.e. no parser will match it.

You can do (at least) something of the following:

Add an optional TODO_COMMENT? in front of every parser rule.
Add the TODO_COMMENT token to a separate channel, e.g. ToDoCommentChannel (don't forget to define the constant for this channel!) and select each construct that follows a comment in tree-walking.

Rough outline of what I would do:

Use a separate channel for the TODO_COMMENTs.
lex and parse as usual
get all tokens from token stream and find those of the desired channel and get the following token on the default channel and store these in a list.
walk the parse and check for each entered rule if the starting token is in the list. If yes, copy the rules text to your result list, else recurse (if TODO_COMMENT can be nested, even recurse when the starting token is in the list).

UPDATE:

About the rule 'LINE_COMMENT' contains a lexer command with an unrecognized constant value; lexer interpreters may produce incorrect output error:

This can be ignored since it only affects the interpreter like used by Antlrworks2 or the plugin. You can also do it like this:

//Instead of
TODO_COMMENT
    :   '// TODO' ~[\r\n]*  -> channel(ToDoCommentChannel)
    ;    

// do this (assuming the channel value is indeed 42):
TODO_COMMENT
    :   '// TODO' ~[\r\n]*  -> channel(42 /*ToDoCommentChannel*/)
    ;

This will work in both Antlrworks2 and the code (you can still use the constant value for the channel in your java runtime code).

Channels work fine, but I intended to work with TODOs as AST nodes. However, I consider this question as closed, thanks everyone for help :) — petrbel, Jun 23 '14 at 15:51
If you want them as nodes (I really don't why you should), use option 1 and put them in front of every location they could occur. — Onur, Jun 23 '14 at 16:23

how to parse only some comments using ANTLRv4

1 Answers1