ANTLR proper ordering of grammar rules

Question

I am trying to write a grammar that will recognize <<word>> as a special token but treat <word> as just a regular literal.

Here is my grammar:

grammar test;

doc: item+ ;
item: func | atom ;

func: '<<' WORD '>>' ;
atom: PUNCT+            #punctAtom
    | NEWLINE+          #newlineAtom
    | WORD              #wordAtom
    ;

WS : [ \t] -> skip ;
NEWLINE : [\n\r]+ ;
PUNCT : [.,?!]+ ;
WORD : CHAR+ ;

fragment CHAR : (LETTER | DIGIT | SYMB | PUNCT) ;
fragment LETTER : [a-zA-Z] ;
fragment DIGIT : [0-9] ;
fragment SYMB : ~[a-zA-Z0-9.,?! |{}\n\r\t] ;

So something like <<word>> will be matched by two rules, both func and atom. I want it to be recognized as a func, so I put the func rule first.

When I test my grammar with <word> it treats it as an atom, as expected. However when I test my grammar and give it <<word>> it treats it as an atom as well.

Is there something I'm missing?

PS - I have separated atom into PUNCT, NEWLINE, and WORD and given them labels #punctAtom, #newlineAtom, and #wordAtom because I want to treat each of those differently when I traverse the parse tree. Also, a WORD can contain PUNCT because, for instance, someone can write "Hello," and I want to treat that as a single word (for simplicity later on).

PPS - One thing I've tried is I've included < and > in the last rule, which is a list of symbols that I'm "disallowing" to exist inside a WORD. This solves one problem, in that <<word>> is now recognized as a func, but it creates a new problem because <word> is no longer accepted as an atom.

score 2 · Accepted Answer · answered Apr 12 '18 at 18:31

ANTLR's lexer tries to match as much characters as possible, so both <<WORD>> and <WORD> are matched by the lexer rul WORD. Therefor, there in these cases the tokens << and >> (or < and > for that matter) will not be created.

You can see what tokens are being created by running these lines of code:

Lexer lexer = new testLexer(CharStreams.fromString("<word> <<word>>"));
CommonTokenStream tokens = new CommonTokenStream(lexer);
tokens.fill();

for (Token t : tokens.getTokens()) {
  System.out.printf("%-20s %s\n", testLexer.VOCABULARY.getSymbolicName(t.getType()), t.getText());
}

which will print:

WORD                 <word>
WORD                 <<word>>
EOF                  <EOF>

What you could do is something like this:

func
 : '<<' WORD '>>' 
 ;

atom
 : PUNCT+   #punctAtom
 | NEWLINE+ #newlineAtom
 | word     #wordAtom
 ;

word
 : WORD
 | '<' WORD '>'
 ;

...

fragment SYMB : ~[<>a-zA-Z0-9.,?! |{}\n\r\t] ;

Of course, something like foo<bar will not become a single WORD, which it previously would.

ANTLR proper ordering of grammar rules

1 Answers1