1

I have defined the following grammar:

grammar Test;

parse: expr EOF;

expr :  IF comparator FROM field THEN                                                                   #comparatorExpr
;

dateTime        :   DATE_TIME;
number          :   (INT|DECIMAL);
field           :   FIELD_IDENTIFIER;
op              :   (GT | GE | LT | LE | EQ);
comparator      :   op (number|dateTime);

fragment LETTER : [a-zA-Z];
fragment DIGIT  : [0-9];

IF                   : '$IF';
FROM                 : '$FROM';
THEN                 : '$THEN';
OR                   : '$OR';
GT                   : '>' ;
GE                   : '>=' ;
LT                   : '<' ;
LE                   : '<=' ;
EQ                   : '=' ;
INT                  : DIGIT+;
DECIMAL              : INT'.'INT;
DATE_TIME            : (INT|DECIMAL)('M'|'y'|'d');
FIELD_IDENTIFIER     : (LETTER|DIGIT)(LETTER|DIGIT|' ')*;
WS                   : [ \r\t\u000C\n]+ -> skip;

And I try to parse the following input:

$IF >=15 $FROM AgeInYears $THEN

it gives me the following error:

line 1:6 mismatched input '15 ' expecting {INT, DECIMAL, DATE_TIME}

All SO posts I found point out to the same reason for this error - identical LEXER rules. But I cannot see why 15 can be matched to either DECIMAL - it requires . between 2 ints, or to DATE_TIME - it has m|d|y suffix as well.

Any pointers would be appreciated here.

avs099
  • 10,937
  • 6
  • 60
  • 110

1 Answers1

2

It's always a good idea to run take a look at the token stream that your Lexer produces:

 grun Test parse -tokens -tree Test.txt
[@0,0:2='$IF',<'$IF'>,1:0]
[@1,4:5='>=',<'>='>,1:4]
[@2,6:8='15 ',<FIELD_IDENTIFIER>,1:6]
[@3,9:13='$FROM',<'$FROM'>,1:9]
[@4,15:25='AgeInYears ',<FIELD_IDENTIFIER>,1:15]
[@5,26:30='$THEN',<'$THEN'>,1:26]
[@6,31:30='<EOF>',<EOF>,1:31]
line 1:6 mismatched input '15 ' expecting {INT, DECIMAL, DATE_TIME}
(parse (expr $IF (comparator (op >=) 15 ) $FROM (field AgeInYears ) $THEN) <EOF>)

Here we see that "15 " (1 5 space) has been matched by the FIELD_IDENTIFIER rule. Since that's three input characters long, ANTLR will prefer that Lexer rule to the INT rule that only matches 2 characters.

For this particular input, you can solve this be reworking the FIELD_IDENTIFIER rule to be:

FIELD_IDENTIFIER: (LETTER | DIGIT)+ (' '+ (LETTER | DIGIT))*;
grun Test parse -tokens -tree Test.txt
[@0,0:2='$IF',<'$IF'>,1:0]
[@1,4:5='>=',<'>='>,1:4]
[@2,6:7='15',<INT>,1:6]
[@3,9:13='$FROM',<'$FROM'>,1:9]
[@4,15:24='AgeInYears',<FIELD_IDENTIFIER>,1:15]
[@5,26:30='$THEN',<'$THEN'>,1:26]
[@6,31:30='<EOF>',<EOF>,1:31]
(parse (expr $IF (comparator (op >=) (number 15)) $FROM (field AgeInYears) $THEN) <EOF>)

That said, I suspect that attempting to allow spaces within your FIELD_IDENTIFIER (without some sort of start/stop markers), is likely to be a continuing source of pain as you work on this. (There's a reason why you don't see this is most languages, and it's not that nobody thought it would be handy to allow for multi-word identifiers. It requires a greedy lexer rule that is likely to take precedence over other rules (as it did here)).

Bart Kiers
  • 166,582
  • 36
  • 299
  • 288
Mike Cargal
  • 6,610
  • 3
  • 21
  • 27