Why isn't ANTLR 4 recognising Unicode characters as valid tokens?

Question

I have been struggling to get ANTLR 4 to recognise Unicode characters in the input.

I reduced my grammar to a simpler test I found on this answer to a related question, but all I've done is change the characters it's supposed to recognise, and it didn't work.

Grammar:

grammar Unicode;

stat: E_CODE* EOF;

E_CODE: '↊' | '↋';

Test class:

class UnicodeTest {
    @Test
    fun `parse unicode`() {
        val lexer = UnicodeLexer(CharStreams.fromString("↊↋"))
        val parser = UnicodeParser(CommonTokenStream(lexer))
        val result = parser.stat().text
        println("Result = <$result>")
        assertThat(result).isEqualTo("↊↋<EOF>")
    }
}

What I get when I run this is:

> Task :test FAILED
line 1:0 token recognition error at: '↊'
line 1:1 token recognition error at: '↋'
Result = <<EOF>>

expected:<"[↊↋]<EOF>"> but was:<"[]<EOF>">
Expected :"[↊↋]<EOF>"
Actual   :"[]<EOF>"

From stderr, it looks like it is correctly pulling the characters from my string as Unicode (it did start as a String so it had better!), but then not recognising the characters as a valid token.

I'm not sure how to debug this sort of thing, because the lexer rules get compiled into a giant blob that I can't figure out how to read. What I can verify is that tokens inside the lexer only contains one element, the EOF.

Ruled out so far:

The grammar file itself is UTF-8.

The Java compiler encoding is definitely set to UTF-8.

tasks.withType<JavaCompile> {
    // Why is this not yet the default? :(
    options.encoding = "UTF-8"
}

The Kotlin compiler encoding is supposedly always UTF-8 with no option to change that. Mentioned only because I have no idea which compiler is used to compile the Java classes.

When I run tests, those also run as UTF-8.

tasks.withType<Test> {
    useJUnitPlatform()
    defaultCharacterEncoding = "UTF-8"
}

I get the same issue when running the code in my main program, where I can see on the command-line that -Dfile.encoding=UTF-8 is on the command-line.

Workaround?

If I change the grammar file to use Unicode escapes explicitly, then it works! So OK, there's something about how ANTLR is reading the file, where it isn't defaulting to UTF-8 as many people are saying it does. I plan to use a lot of Unicode though and would prefer not to have to escape everything. So I guess I just have to find some appropriate Gradle config to force the encoding when its compiler runs. :/

Did you use the `-encoding` option on the java -jar invocation of the Antlr tool? — kaby76, Feb 08 '22 at 23:20
@kaby76 It's being run by Gradle, so that's a very good question. — Hakanai, Feb 08 '22 at 23:53
Maybe use something like [arguments = arguments + listOf("-encoding", "UTF-8")](https://docs.gradle.org/current/userguide/antlr_plugin.html#sec:controlling_the_antlr_generator_process) (or "utf8", "utf-8", ...)? The other option is to check and set the locale for your machine. The default for the tool is to [use the default locale](https://github.com/antlr/antlr4/blob/fcab02cfd0dedd3b091c8758173b14cbbf4178cf/tool/src/org/antlr/v4/Tool.java#L101). — kaby76, Feb 09 '22 at 00:06
@kaby76 that's definitely the fix, if you want to submit that as an answer. :) It's confounding because I was pretty sure even the ANTLR book said it defaulted to UTF-8. — Hakanai, Feb 09 '22 at 22:35

score 1 · Answer 1 · answered Feb 09 '22 at 07:19

1

How source files are compiled are (AFAIK) not important.

Using your example grammar as-is, I ran the following tests:

InputStream inputStream = new ByteArrayInputStream("↊↋".getBytes());
UnicodeLexer lexer = new UnicodeLexer(CharStreams.fromStream(inputStream, StandardCharsets.UTF_8));
UnicodeParser parser = new UnicodeParser(new CommonTokenStream(lexer));
System.out.println(parser.stat().getText());

and:

UnicodeLexer lexer = new UnicodeLexer(CharStreams.fromFileName("input.txt", StandardCharsets.UTF_8));
UnicodeParser parser = new UnicodeParser(new CommonTokenStream(lexer));
System.out.println(parser.stat().getText());

(where the file input.txt contains ↊↋)

and both resulted in the following being printed to my console:

↊↋<EOF>

I.e. did you try adding the encoding to the CharStream?

answered Feb 09 '22 at 07:19

Bart Kiers

166,582
36
299
288

CharStream takes a string so I don't get the opportunity to pass an encoding anyway. And I can see from the output that it's reading the string correctly. In any case, even if I convert to bytes and then pass that in with the encoding, I get the same result as passing in the string. So I'm pretty sure it's how the grammar is being compiled. Most likely, all the info around the web about ANTLR defaulting to UTF-8 is a lie. – Hakanai Feb 09 '22 at 22:34
I bet you're running this on a machine where the default platform encoding just happens to be UTF-8 and that's the only reason it worked. :) – Hakanai Feb 09 '22 at 22:39
Then things work differently than that I thought/expected. That is too bad :(. I'll remove this answer shortly. – Bart Kiers Feb 10 '22 at 08:16

Why isn't ANTLR 4 recognising Unicode characters as valid tokens?

1 Answers1