1

Using ANTLR v4.9.3, I created the following grammar …

grammar G ;
start : s EOF ;
s : 'ba' a b ;
a : 'b' ;
b : 'a' ;

Given the above grammar, I thought that the following derivation is possible …

starts'ba' a b'ba' 'b' b'ba' 'b' 'a' = 'baba'

However, my Java test program indicates a syntax error occurs when trying to parse the string baba.

Shouldn't the string baba be in the language generated by grammar G ?


user3134725
  • 1,003
  • 6
  • 12
  • 1
    It doesn't work because the tokens recognized are { 'ba' 'ba' EOF }. Neither a nor b cannot derive 'ba'. The lexer is not parser context sensitive (usually). When developing your grammar, it's a good idea to print the tokens out when you see something unusual. – kaby76 Apr 06 '22 at 17:29
  • 1
    It appears that the fix is to change the rule `s : 'ba' a b ;` to `s : 'b' 'a' a b ;` so that the string `ba` is not considered to be a token. – user3134725 Apr 06 '22 at 18:46

1 Answers1

0

Although the conclusion/answer is already in the comments, here an answers that explains it in a bit more detail.

When defining literal tokens inside parser rule (the 'ba', 'a' and 'b'), ANTLR implicitly creates the following grammar:

grammar G ;
start : s EOF ;
s : T__0 a b ;
a : T__1 ;
b : T__2 ;

T__0 : 'ba';
T__1 : 'b';
T__2 : 'a';

Now, when the lexer get the input "baba", it will create 2 T__0 tokens. The lexer is not driven by whatever the parser is trying to match. It works independently from the parser. The lexer creates tokens following these 2 rules:

  1. try to match as many characters as possible for a rule
  2. when 2 (or more) lexer rules match the same characters, let the one defined first "win"

Because of rule 1, it is apparent that 2 T__0 tokens are created.

As you already mentioned in a comment, removing the 'ba' token (and using 'b' followed by 'a') would resolve the issue:

grammar G ;
start : s EOF ;
s : 'b' 'a' a b ;
a : 'b' ;
b : 'a' ;

which would really be the grammar:

grammar G ;
start : s EOF ;
s : T__0 T__1 a b ;
a : T__0 ;
b : T__1 ;

T__0 : 'b';
T__1 : 'a';
Bart Kiers
  • 166,582
  • 36
  • 299
  • 288