I have been struggling to get ANTLR 4 to recognise Unicode characters in the input.
I reduced my grammar to a simpler test I found on this answer to a related question, but all I've done is change the characters it's supposed to recognise, and it didn't work.
Grammar:
grammar Unicode;
stat: E_CODE* EOF;
E_CODE: '↊' | '↋';
Test class:
class UnicodeTest {
@Test
fun `parse unicode`() {
val lexer = UnicodeLexer(CharStreams.fromString("↊↋"))
val parser = UnicodeParser(CommonTokenStream(lexer))
val result = parser.stat().text
println("Result = <$result>")
assertThat(result).isEqualTo("↊↋<EOF>")
}
}
What I get when I run this is:
> Task :test FAILED
line 1:0 token recognition error at: '↊'
line 1:1 token recognition error at: '↋'
Result = <<EOF>>
expected:<"[↊↋]<EOF>"> but was:<"[]<EOF>">
Expected :"[↊↋]<EOF>"
Actual :"[]<EOF>"
From stderr, it looks like it is correctly pulling the characters from my string as Unicode (it did start as a String
so it had better!), but then not recognising the characters as a valid token.
I'm not sure how to debug this sort of thing, because the lexer rules get compiled into a giant blob that I can't figure out how to read. What I can verify is that tokens
inside the lexer only contains one element, the EOF.
Ruled out so far:
- The grammar file itself is UTF-8.
- The Java compiler encoding is definitely set to UTF-8.
tasks.withType<JavaCompile> { // Why is this not yet the default? :( options.encoding = "UTF-8" }
- The Kotlin compiler encoding is supposedly always UTF-8 with no option to change that. Mentioned only because I have no idea which compiler is used to compile the Java classes.
- When I run tests, those also run as UTF-8.
tasks.withType<Test> { useJUnitPlatform() defaultCharacterEncoding = "UTF-8" }
- I get the same issue when running the code in my main program, where I can see on the command-line that
-Dfile.encoding=UTF-8
is on the command-line.
Workaround?
- If I change the grammar file to use Unicode escapes explicitly, then it works! So OK, there's something about how ANTLR is reading the file, where it isn't defaulting to UTF-8 as many people are saying it does. I plan to use a lot of Unicode though and would prefer not to have to escape everything. So I guess I just have to find some appropriate Gradle config to force the encoding when its compiler runs. :/