0

I have really simple XML (HTML) parsing ANTLR grammar:

wiki: ggg+;

ggg: tag | text;

tag: '<' tx=TEXT { System.out.println($tx.getText()); } '>';

text: tx=TEXT { System.out.println($tx.getText()); };

CHAR: ~('<'|'>');
TEXT: CHAR+;

With such input: "<ggg> fff" it works fine.

But when I start to deal with whitespaces it fails. For example:

  • " <ggg> fff " - fails at beggining
  • "<ggg> <hhh> " - fails after <ggg>
  • "<ggg> fff " - works fine
  • "<ggg> " - fails at end

I don't know what is wrong. Maybe there is some special grammar option to handle this. ANTLRWorks gives me NoViableAltException.

pablo
  • 384
  • 2
  • 5
  • 17

2 Answers2

3

ANTLR's lexer rules match as much as possible. Only when 2 (or more) rule(s) match the same amount of characters, the rule defined first will "win". Because of that, a single character other than '<' and '>' is tokenized as a CHAR token, and not as TEXT token, regardless of what the parser "needs" (the lexer operates independently from the parser, remember that!). Only two or more characters other than '<' and '>' are being tokenized as a (single) TEXT token.

So, therefor the input " <ggg> fff " creates the following 5 tokens:

type    | text
--------+-----------
CHAR    |   ' '
'<'     |   '<'
TEXT    |   'ggg'
'>'     |   '>'
TEXT    |   ' fff '

And since the token CHAR is not accounted for in your parser rule(s), the parse fails.

Simply remove CHAR and do:

TEXT : ~('<'|'>')+;
Bart Kiers
  • 166,582
  • 36
  • 299
  • 288
1

You have no token to deal with the space. A space for a lexer is no different from any other character it may encounter.

If whitespace is unimportant you can simply use:

WHITESPACE : ( '\t' | ' ' | '\r' | '\n'| '\u000C' )+    { $channel = HIDDEN; } ;

If whitespace is important to you:

WHITESPACE : ( '\t' | ' ' | '\r' | '\n'| '\u000C' )+
CHAR: ~('<'|'>');
TEXT: (CHAR|WHITESPACE)+;
Elliot Chance
  • 5,526
  • 10
  • 49
  • 80
  • Whitespaces are important to me. You wrote that a space for a lexer is no different from any other character. But in my example CHAR token should match any whitespace character. So it should works but it does not. Conclusion: whitespace for a lexer is different from others characters! – pablo Jun 24 '12 at 13:24
  • And that you gived is wrong because there are multiple alternatives (WHITESPACE and CHAR). – pablo Jun 24 '12 at 13:34
  • @pablo, you are right in the fact that your lexer rule `CHAR` accounts for space-chars, but your conclusion is wrong. I'll explain shortly in an answer. – Bart Kiers Jun 24 '12 at 13:59