How do I specify a unicode literal that requires more than four hex digits in Antlr?

Question

I want to define a lexer rule for ranges between unicode characters that have code points that need more than four hexadecimal digits to identify. To be concrete, I want to declare the following rule:

ID_Continue : [\uE0100-\uE01EF] ;

Unfortunately, it doesn't work. This rule will match characters that are not in this range. (I'm not certain to what exact behaviour this results in, but it isn't the one I want.) I've tried also the following (padding with leading zeros and using 8 digits):

ID_Continue : [\U000E0100-\U000E01EF] ;

But it seems to result in the same unwanted behaviour.

I am using Antlr4 and the IntelliJ plugin for it for testing.

Does Antlr4 not support unicode literals above \uFFFF?

score 3 · Accepted Answer · answered Mar 11 '16 at 11:46

No, ANTLR's max is the same as Java's Character.MAX_VALUE

If you look at (a part of) ANTLR4's lexer grammar you will see these rules:

// Any kind of escaped character that we can embed within ANTLR literal strings.
fragment EscSeq
    :   Esc
        ( [btnfr"'\\]   // The standard escaped character set such as tab, newline, etc.
        | UnicodeEsc    // A Unicode escape sequence
        | .             // Invalid escape character
        | EOF           // Incomplete at EOF
        )
    ;

...

fragment UnicodeEsc
    :   'u' (HexDigit (HexDigit (HexDigit HexDigit?)?)?)?
    ;

...

fragment Esc : '\\' ;

I did look there, but I wanted to be absolutely sure. Thank you! — Mats Rydberg, Mar 11 '16 at 14:08

score 0 · Answer 2 · answered Mar 12 '16 at 10:21

Note: the limitation to the BMP is purely a Java limitation. Other targets might go much further. For instance my MySQL grammar, written for ANTLR3 (C target) can easily lex e.g. emojis from beyond the BMP. This works for quoted strings as well as IDENTIFIERs.

What's a bit strange here is however that I haven't specified that range in the grammar (it uses only the BMP). Still the parser can parse any utf-8 input. Might be a bug in the target runtime, though I'm happy it exists :-D

How do I specify a unicode literal that requires more than four hex digits in Antlr?

2 Answers2