What is wrong with whitespaces in ANTLR?

Question

I have really simple XML (HTML) parsing ANTLR grammar:

wiki: ggg+;

ggg: tag | text;

tag: '<' tx=TEXT { System.out.println($tx.getText()); } '>';

text: tx=TEXT { System.out.println($tx.getText()); };

CHAR: ~('<'|'>');
TEXT: CHAR+;

With such input: "<ggg> fff" it works fine.

But when I start to deal with whitespaces it fails. For example:

" <ggg> fff " - fails at beggining
"<ggg> <hhh> " - fails after <ggg>
"<ggg> fff " - works fine
"<ggg> " - fails at end

I don't know what is wrong. Maybe there is some special grammar option to handle this. ANTLRWorks gives me NoViableAltException.

score 3 · Accepted Answer · answered Jun 24 '12 at 14:04

ANTLR's lexer rules match as much as possible. Only when 2 (or more) rule(s) match the same amount of characters, the rule defined first will "win". Because of that, a single character other than '<' and '>' is tokenized as a CHAR token, and not as TEXT token, regardless of what the parser "needs" (the lexer operates independently from the parser, remember that!). Only two or more characters other than '<' and '>' are being tokenized as a (single) TEXT token.

So, therefor the input " <ggg> fff " creates the following 5 tokens:

type    | text
--------+-----------
CHAR    |   ' '
'<'     |   '<'
TEXT    |   'ggg'
'>'     |   '>'
TEXT    |   ' fff '

And since the token CHAR is not accounted for in your parser rule(s), the parse fails.

Simply remove CHAR and do:

TEXT : ~('<'|'>')+;

Elliot Chance · Answer 2 · 2012-06-24T13:27:16.340

1

You have no token to deal with the space. A space for a lexer is no different from any other character it may encounter.

If whitespace is unimportant you can simply use:

WHITESPACE : ( '\t' | ' ' | '\r' | '\n'| '\u000C' )+    { $channel = HIDDEN; } ;

If whitespace is important to you:

WHITESPACE : ( '\t' | ' ' | '\r' | '\n'| '\u000C' )+
CHAR: ~('<'|'>');
TEXT: (CHAR|WHITESPACE)+;

edited Jun 24 '12 at 13:27

answered Jun 24 '12 at 13:20

Elliot Chance

5,526
10
49
80

Whitespaces are important to me. You wrote that a space for a lexer is no different from any other character. But in my example CHAR token should match any whitespace character. So it should works but it does not. Conclusion: whitespace for a lexer is different from others characters! – pablo Jun 24 '12 at 13:24
And that you gived is wrong because there are multiple alternatives (WHITESPACE and CHAR). – pablo Jun 24 '12 at 13:34
@pablo, you are right in the fact that your lexer rule `CHAR` accounts for space-chars, but your conclusion is wrong. I'll explain shortly in an answer. – Bart Kiers Jun 24 '12 at 13:59

What is wrong with whitespaces in ANTLR?

2 Answers2