Antlr3 matching tokens without whitespace

Question

Given the input "term >1", the number(1) and comparison operator(>) should generate seperate nodes in an AST. How can this be achieved?

In my tests matching only occured if "c" and "1" where seperated with a space like so "term < 1".

Current grammar:

startExpression  : orEx;

expressionLevel4    
: LPARENTHESIS! orEx RPARENTHESIS! | atomicExpression;
expressionLevel3    
: (fieldExpression) | expressionLevel4 ;
expressionLevel2    
: (nearExpression) | expressionLevel3 ;
expressionLevel1    
: (countExpression) | expressionLevel2 ;
notEx   : (NOT^)? expressionLevel1;
andEx   : (notEx        -> notEx)
(AND? a=notEx -> ^(ANDNODE $andEx $a))*;
orEx    : andEx (OR^  andEx)*;

 countExpression  : COUNT LPARENTHESIS WORD RPARENTHESIS RELATION NUMBERS -> ^(COUNT WORD RELATION NUMBERS);

nearExpression  : NEAR LPARENTHESIS (WORD|PHRASE) MULTIPLESEPERATOR (WORD|PHRASE) MULTIPLESEPERATOR NUMBERS RPARENTHESIS -> ^(NEAR WORD* PHRASE* ^(NEARDISTANCE NUMBERS));

fieldExpression : WORD PROPERTYSEPERATOR WORD -> ^(FIELDSEARCH ^(TARGETFIELD WORD) WORD );

atomicExpression 
: WORD
| PHRASE
;

fragment NUMBER : ('0'..'9');
fragment CHARACTER : ('a'..'z'|'A'..'Z'|'0'..'9'|'*'|'?');
fragment QUOTE     : ('"');
fragment LESSTHEN : '<';
fragment MORETHEN: '>';
fragment EQUAL: '=';
fragment SPACE     : ('\u0009'|'\u0020'|'\u000C'|'\u00A0');
fragment UNICODENOSPACES:  ('\u0021'..'\u0027'|'\u0030'..'\u0039'|'\u003B'..'\u007E'|'\u00A1'..'\uFFFF');
//fragment UNICODENOSPACES  :  ('\u0021'..'\u0039'|'\u003B'..'\u007E'|'\u00A1'..'\uFFFF');

LPARENTHESIS : '(';
RPARENTHESIS : ')';

AND    : ('A'|'a')('N'|'n')('D'|'d');
OR     : ('O'|'o')('R'|'r');
ANDNOT : ('A'|'a')('N'|'n')('D'|'d')('N'|'n')('O'|'o')('T'|'t');
NOT    : ('N'|'n')('O'|'o')('T'|'t');
COUNT:('C'|'c')('O'|'o')('U'|'u')('N'|'n')('T'|'t');
NEAR:('N'|'n')('E'|'e')('A'|'a')('R'|'r');
PROPERTYSEPERATOR : ':';
MULTIPLESEPERATOR : ',';

WS     : (SPACE) { $channel=HIDDEN; };
RELATION : LESSTHEN? MORETHEN? EQUAL?;
NUMBERS : (NUMBER)+;
PHRASE : (QUOTE)(CHARACTER)+((SPACE)+(CHARACTER)+)+(QUOTE);
WORD   : (UNICODENOSPACES)+;

Bart Kiers · Accepted Answer · 2012-12-03T14:02:38.343

That is because your WORD rule matches too much: it also matches ">" so when ">1" are written together, these 2 chars are tokenized as a single WORD-token.

Whenever I'm unsure what my lexer is doing, I simple let the parser match zero or more tokens of any type, and print the type and text of all tokens:

parse
 : (t=. {System.out.printf("\%-15s '\%s'\n", tokenNames[$t.type], $t.text);})* EOF
 ;

When you let the rule above match your input "term > 1", the following gets printed:

WORD            'term'
RELATION        '>'
WORD            '1'

and of the input "term" >1

WORD            'term'
WORD            '>1'

There's no way around this: when the lexer can match 2 (or more) characters (the WORD rule), it will choose that path over a rule defined before it which will only match a single char (the RELATION rule).

Also note that your RELATION rule:

RELATION : LESSTHEN? MORETHEN? EQUAL?;

potentially matches the empty string. Make sure every lexer rule matches at least 1 character, otherwise your lexer might get into an infinite loop.

Better do something like this:

RELATION
 : (LESSTHEN | MORETHEN)? EQUAL // '<=', '>=', or '='
 | (LESSTHEN | MORETHEN)        // '<' or '>'
 ;

Thanks! *sight* so again WORD has to be adjusted. Seeing AntlrWorks parse the term step by step and then failing to find a RELATION (which is there from my pov) really goes against how I pictured the process in my mind. Is this (excluding every char from WORD) really how its done properly? WORDS will never match those terms which have these excluded, because otherwise used characters. — Th 00 mÄ s, Dec 03 '12 at 14:07
@ThomAS, I don't know your requirements exactly, but if you can, let a `WORD` at the very least start with a letter (not a `>` or `(` etc.). That should fix this issue (and other issues you're having too, maybe). — Bart Kiers, Dec 03 '12 at 14:11
Having every matcher rule match at least 1 character and protecting the start of the WORD are two very good tips! Thanks for (again) helping me out! — Th 00 mÄ s, Dec 03 '12 at 14:23

Antlr3 matching tokens without whitespace

1 Answers1

Linked