1

I'm trying to define the language of XQuery and XPath in test.g4. The part of the file relevant to my question looks like:

grammar test;

ap: 'doc' '(' '"' FILENAME '"' ')' '/' rp
|   'doc' '(' '"' FILENAME '"' ')' '//' rp
;

rp: ...;

f:  ...;

xq: STRING
|   ...
;

FILENAME    : [a-zA-Z0-9/_]+ '.xml'  ;
STRING      : '"' [a-zA-Z0-9~!@#$%^&*()=+._ -]+ '"';

WS: [ \n\t\r]+ -> skip;

I tried to parse something like doc("movies.xml")//TITLE, but it gives

line 1:4 no viable alternative at input 'doc("movies.xml"'

But if I remove the STRING parser rule, it works fine. And since FILENAME appears before STRING, I don't know why it fails to match doc("movies.xml")//TITLE with the FILENAME parser rule. How can I fix this? Thank you!

paranoider
  • 27
  • 2

1 Answers1

1

The literal tokens you have in your grammar, are nothing more than regular tokens. So your lexer will look like this:

TOKEN_1  : 'doc';
TOKEN_2  : '(';
TOKEN_3  : '"';
TOKEN_4  : ')';
TOKEN_5  : '/'; 
TOKEN_6  : '//';
FILENAME : [a-zA-Z0-9/_]+ '.xml'  ;
STRING   : '"' [a-zA-Z0-9~!@#$%^&*()=+._ -]+ '"';
WS       : [ \n\t\r]+ -> skip;

(they're not really called TOKEN_..., but that's unimportant)

Now, the way ANTLR creates tokens is to try to match as much characters as possible. Whenever two (or more) rules match the same amount of characters, the one defined first "wins". Given these 2 rules, the input doc("movies.xml") will be tokenised as follows:

  • doc → TOKEN_1
  • ( → TOKEN_2
  • "movies.xml" → STRING
  • ) → TOKEN_4

Since ANTLR tries to match as many characters as possible, "movies.xml" is tokenised as a single token. The lexer does not "listen" to what the parser might need at a given time. This is how ANTLR works, you cannot change this.

FYI, there's a user contributed XPath grammar here: https://github.com/antlr/grammars-v4/blob/master/xpath/xpath.g4

Bart Kiers
  • 166,582
  • 36
  • 299
  • 288