How do I match unicode characters in antlr

Question

I am trying to pick out all tokens in a text and need to match all Ascii and Unicode characters, so here is how I have laid them out.

fragment CHAR     :  ('A'..'Z') | ('a'..'z');
fragment DIGIT    :  ('0'..'9');
fragment UNICODE  :  '\u0000'..'\u00FF';

Now if I write my token rule as:

TOKEN  :  (CHAR|DIGIT|UNICODE)+;

I get "Decision can match input such as "'A'..'Z'" using multiple alternatives: 1, 3 As a result, alternative(s) 3 were disabled for that input" " Decision can match input such as "'0'..'9'" using multiple alternatives: 2, 3 As a result, alternative(s) 3 were disabled for that input"

And nothing gets matched: And also if I write it as

TOKEN  :  (UNICODE)+;

Nothing gets matched.

Is there a way of doing this.

'\u0000'..'\u00FF' does not cover "all Unicode characters", it only covers the first 256. — Michael Madsen, Jan 17 '10 at 17:24
True, but I thought Java doesn't support five-digit Unicode yet. — Lezan, Jan 17 '10 at 19:24
With \u00FF, we're not in 5 digit Unicode country yet; that's only 2 so far. There's still all the characters from \u0100 to about \uF8FF. — Carl Smotricz, Jan 17 '10 at 19:43
Actually, since the subset \u0000..\u00FF is an exact duplicate of ISO-8859-1, you could argue we haven't got any Unicode at all in there. :o) (And for the record: The highest valid codepoint in the BMP is \uFFFD, the replacement character, but not all codepoints up to that value are assigned. \uFFFE and \uFFFF are not characters.) — Michael Madsen, Jan 17 '10 at 19:56
Ah sorry I just realised I meant \uFFFFF with out the 0 but yes that is wrong as well since it doesn't have a value. — Lezan, Jan 17 '10 at 22:31

score 7 · Accepted Answer · answered Jan 18 '10 at 21:06

One other thing to consider if you are planning on using Unicode is that you should set the charvocabulary option to say that you want to allow any char in the Unicode range of 0 through FFFE

options
{
charVocabulary='\u0000'..'\uFFFE';
}

The default you'll usually see in the examples is

options
{
charVocabulary = '\3'..'\377';
}

To cover the point made above. Generally if you needed both the ascii character range 'A'..'Z' and the unicode range you'd make a unicode lexer rule like: '\u0080'..'\ufffe'

Note: The option "charVocabulary" is not available in antlr3 as it uses unicode by default. — Th 00 mÄ s, Nov 27 '12 at 10:05

score 5 · Answer 2 · answered Jan 17 '10 at 17:23

Practically speaking, TOKEN: (UNICODE)+ is completely useless.

Since everything is a token character, if you try to use such a rule to match a Java program, say, it will simply match the whole program and return it to you as one big token.

You really do need to break your characters down into different groups if you want to split your input apart into meaningful fragments.

It might help you to take a look at how the "pros" have done it. Here is a BNF grammar for Java, and here is BNF for an identifier, which shows how they took to the trouble to group out

identifier 
  ::= "a..z,$,_" { "a..z,$,_,0..9,unicode character over 00C0" }

How do I match unicode characters in antlr

2 Answers2