Why doesn't this ANTLR grammar derive the string `baba`?

Question

Using ANTLR v4.9.3, I created the following grammar …

grammar G ;
start : s EOF ;
s : 'ba' a b ;
a : 'b' ;
b : 'a' ;

Given the above grammar, I thought that the following derivation is possible …

start → s → 'ba' a b → 'ba' 'b' b → 'ba' 'b' 'a' = 'baba'

However, my Java test program indicates a syntax error occurs when trying to parse the string baba.

Shouldn't the string baba be in the language generated by grammar G ?

It doesn't work because the tokens recognized are { 'ba' 'ba' EOF }. Neither a nor b cannot derive 'ba'. The lexer is not parser context sensitive (usually). When developing your grammar, it's a good idea to print the tokens out when you see something unusual. — kaby76, Apr 06 '22 at 17:29
It appears that the fix is to change the rule `s : 'ba' a b ;` to `s : 'b' 'a' a b ;` so that the string `ba` is not considered to be a token. — user3134725, Apr 06 '22 at 18:46

score 0 · Accepted Answer · answered Apr 07 '22 at 06:39

Although the conclusion/answer is already in the comments, here an answers that explains it in a bit more detail.

When defining literal tokens inside parser rule (the 'ba', 'a' and 'b'), ANTLR implicitly creates the following grammar:

grammar G ;
start : s EOF ;
s : T__0 a b ;
a : T__1 ;
b : T__2 ;

T__0 : 'ba';
T__1 : 'b';
T__2 : 'a';

Now, when the lexer get the input "baba", it will create 2 T__0 tokens. The lexer is not driven by whatever the parser is trying to match. It works independently from the parser. The lexer creates tokens following these 2 rules:

try to match as many characters as possible for a rule
when 2 (or more) lexer rules match the same characters, let the one defined first "win"

Because of rule 1, it is apparent that 2 T__0 tokens are created.

As you already mentioned in a comment, removing the 'ba' token (and using 'b' followed by 'a') would resolve the issue:

grammar G ;
start : s EOF ;
s : 'b' 'a' a b ;
a : 'b' ;
b : 'a' ;

which would really be the grammar:

grammar G ;
start : s EOF ;
s : T__0 T__1 a b ;
a : T__0 ;
b : T__1 ;

T__0 : 'b';
T__1 : 'a';

Why doesn't this ANTLR grammar derive the string `baba`?

1 Answers1