ANTLR4: Using non-ASCII characters in token rules

Question

On page 74 of the ANTRL4 book it says that any Unicode character can be used in a grammar simply by specifying its codepoint in this manner:

'\uxxxx'

where xxxx is the hexadecimal value for the Unicode codepoint.

So I used that technique in a token rule for an ID token:

grammar ID;

id : ID EOF ;

ID : ('a' .. 'z' | 'A' .. 'Z' | '\u0100' .. '\u017E')+ ;
WS : [ \t\r\n]+ -> skip ;

When I tried to parse this input:

Gŭnter

ANTLR throws an error, saying that it does not recognize ŭ. (The ŭ character is hex 016D, so it is within the range specified)

What am I doing wrong please?

score 10 · Answer 1 · answered Jan 24 '15 at 19:45

10

ANTLR is ready to accept 16-bit characters but, by default, many locales will read in characters as bytes (8 bits). You need to specify the appropriate encoding when you read from the file using the Java libraries. If you are using the TestRig, perhaps through alias/script grun, then use argument -encoding utf-8 or whatever. If you look at the source code of that class, you will see the following mechanism:

InputStream is = new FileInputStream(inputFile);
Reader r = new InputStreamReader(is, encoding); // e.g., euc-jp or utf-8
ANTLRInputStream input = new ANTLRInputStream(r);
XLexer lexer = new XLexer(input);
CommonTokenStream tokens = new CommonTokenStream(lexer);
...

answered Jan 24 '15 at 19:45

Terence Parr

5,912
26
32

Thanks! I tried adding the -encoding flag when invoking TestRig: java org.antlr.v4.runtime.misc.TestRig -encoding UTF-8 ID.g4 However, that resulted in this error : Can't load -encoding as lexer or parser Suggestions? – Roger Costello Jan 24 '15 at 21:32
2

you don't use TestRig on your grammar. that is what antlr is for. – Terence Parr Jan 25 '15 at 01:29
Hi! Still no success. Here's what I did: (1) java org.antlr.v4.Tool -encoding UTF-8 ID.g4 (2) javac *.java (3) java org.antlr.v4.runtime.misc.TestRig ID id -gui < input.txt ... That resulted in a series of "token recognition" errors. Suggestions? – Roger Costello Jan 25 '15 at 10:22
3

-encoding goes on grun per my answer – Terence Parr Jan 25 '15 at 18:19
Ah, yes that was a typo. Here's the command that I actually ran: java org.antlr.v4.runtime.misc.TestRig -encoding UTF-8 ID id -gui < input.txt Again, as before, that resulted in this error : Can't load -encoding as lexer or parser. Suggestions? – Roger Costello Jan 26 '15 at 10:20
3

`java -Dfile.encoding=UTF-8 org.antlr.v4.runtime.misc.TestRig ID ...` – Gunther Jan 26 '15 at 14:38
2

Thanks Gunther. I gave that a go, but it gives the same error message: Can't load -encoding as lexer or parser. – Roger Costello Jan 26 '15 at 17:02
Even though this is a year late, it might help someone as it helped me. It says "Can't load -encoding as lexer or parser." because the name of you parser/lexer has to go right after org.antlr.v4.runtime.misc.TestRig. It is now trying to find the lexer/parser named '-encoding'. The correct command is "java org.antlr.v4.runtime.misc.TestRig grammarName -encoding UTF-8 startRule input.txt" – Emiel Steerneman Apr 11 '16 at 23:12

score 3 · Answer 2 · edited Mar 19 '20 at 13:30

For those having the same problem using antlr4 in java code, ANTLRInputStream beeing deprecated, here is a working way to pass multi-char unicode data from a String to a the MyLexer lexer :

String myString = "\u2013";

CharBuffer charBuffer = CharBuffer.wrap(myString.toCharArray());
CodePointBuffer codePointBuffer = CodePointBuffer.withChars(charBuffer);
CodePointCharStream cpcs = CodePointCharStream.fromBuffer(codePointBuffer);

OneLexer lexer = new MyLexer(cpcs);       
CommonTokenStream tokens = new CommonTokenStream(lexer);

score 3 · Answer 3 · answered Mar 19 '20 at 13:29

Grammar:

NAME:
   [A-Za-z][0-9A-Za-z\u0080-\uFFFF_]+
;

Java:

import org.antlr.v4.runtime.CharStream;
import org.antlr.v4.runtime.CharStreams;
import org.antlr.v4.runtime.CommonTokenStream;
import org.antlr.v4.runtime.TokenStream;

import com.thalesgroup.dms.stimulus.StimulusParser.SystemContext;

final class RequirementParser {

   static SystemContext parse( String requirement ) {
      requirement = requirement.replaceAll( "\t", "   " );
      final CharStream     charStream = CharStreams.fromString( requirement );
      final StimulusLexer  lexer      = new StimulusLexer( charStream );
      final TokenStream    tokens     = new CommonTokenStream( lexer );
      final StimulusParser parser     = new StimulusParser( tokens );
      final SystemContext  system     = parser.system();
      if( parser.getNumberOfSyntaxErrors() > 0 ) {
         Debug.format( requirement );
      }
      return system;
   }

   private RequirementParser() {/**/}
}

Source:

Lexers and Unicode text

score 0 · Answer 4 · answered Dec 12 '22 at 10:13

You can specify the encoding of the file when actually reading the file. For Kotlin/Java that could look like this, no need to specify the encoding in the grammar!

val inputStream: CharStream = CharStreams.fromFileName(fileName, Charset.forName("UTF-16LE"))
val lexer = BlastFeatureGrammarLexer(inputStream)

Supported Charsets by Java/Kotlin

ANTLR4: Using non-ASCII characters in token rules

4 Answers4

Linked

Related