ANTLR and keywords/tokens

Question

I am trying to write a simple grammar in ANTLR4 and use it for my project, but I cannot wrap my head around this problem. Let's say I got the grammar:

parser grammar GrammarParser;
ranks : RANKS COLON entry* ;
entry : rank who SEMICOLON ;
rank  : RANK_HIGH | RANK_LOW ;
who   : ID ;

lexer grammar GrammarLexer;
RANKS      : 'ranks' ;
RANK_HIGH  : 'high' ;
RANK_LOW   : 'low' ;
ID         : [a-zA-Z]+ ;
COLON      : ':' ;
SEMICOLON  : ';' ;
WS         : [ \t\r\n]+ -> skip ;

The problem is that with those grammars this simple example collapses the whole idea:

ranks: low ranks; high high; ranks ranks;

First of all the lexer will return the following stream of tokens for it:

RANKS COLON ID RANKS SEMICOLON ID ID SEMICOLON RANKS RANKS SEMICOLON

which shows the problem. RANKS should be a keyword only for that starting spot - instead it goes over the places where I define (at least in rules) RANKS and ID should be (the first and third entries). Similarly, as RANKS is defined before anything else, lexer chooses over before RANK_HIGH/RANK_LOW/ID (when it and any of those match the same sign sequence from streams). Similarly for ID over RANK_HIGH/RANK_LOW.

So all-in-all I can use 'ranks' everywhere, but it will always be used as RANKS and I cannot use 'high'/'low', because they will always be recognized as ID. Also, ID cannot be 'ranks', because of the priority reasons as well.

Modes do not seem all that helpful here, because the grammar does not indicate when ranks really ends, so it cannot pop modes after reaching it (considering it might be just a small part of the whole file to parse).

Is there any solution for that?

score 1 · Answer 1 · answered May 07 '14 at 00:01

1

Move

ID         : [a-zA-Z]+ ;

after the keyword rules :)

answered May 07 '14 at 00:01

Terence Parr

5,912
26
32

Yes. That's obvious to fix RANK_HIGH/RANK_LOW recognized as ID where those two should not be recognized as ID. But that does not fix the problem when you put... for instance 'ranks' where ID should be in the case. Because 'ranks' is always recognized as RANKS and thus the rule for `entry` is not matched (RANKS != ID). Moreover, now you cannot use 'high'/'low' where ID is in the rule because again RANK_* != ID and thus entries such as "high high" will be rejected. – user767849 May 07 '14 at 00:17
That comment works if by that you meant "move the rule for ID to the very end of the grammar above". Thanks for the comment though. I think I am not horribly wrong, but if so, I would love to hear it. :) – user767849 May 07 '14 at 00:24
there is no need to move ID to the end of the grammar, just until after the keywords. I do not see a whitespace rules so that grammar cannot be producing the results you indicate. Please correct and I will take another look. – Terence Parr May 07 '14 at 21:14
Edited the grammar by adding the WS rule. Nothing special. The problem is that lexers do not know context of tokenized input. So for instance 'ranks' will always be RANKS. You can control that with modes, but the modes are opened and closed (push and pop) by lexer rules and thus if some mode has an opening token but not a closing one (like here RANKS is the opening token of the parsed space of tokens, but there is no clear end indicator). – user767849 May 07 '14 at 21:55
Still RANK_HIGH : 'high' ; RANK_LOW : 'low' ; will never be matched. please fix that too. – Terence Parr May 09 '14 at 16:41
Done. Not sure how it can affect the whole problem. I wrote the grammar "on fly" as an example of the problem I encountered. No matter then. I fixed it on my own in a weird way. Thanks. – user767849 May 10 '14 at 00:22
To get my keywords to work, I had to do this and the accepted answer to [this similar question](https://stackoverflow.com/questions/41421644/antlr4-how-to-build-a-grammar-allowed-keywords-as-identifier) – Corwin Newall Nov 24 '17 at 09:30

ANTLR and keywords/tokens

1 Answers1