Antlr4 doesn't correctly recognizes unicode characters

Question

I've very simple grammar which tries to match 'é' to token E_CODE. I've tested it using TestRig tool (with -tokens option), but parser can't correctly match it. My input file was encoded in UTF-8 without BOM and I've used ANTLR version 4.4. Could somebody else also check this ? I got this output on my console:
line 1:0 token recognition error at: 'Ă'

grammar Unicode;

stat:EOF;  
E_CODE: '\u00E9' | 'é';

Bart Kiers · Accepted Answer · 2014-10-27T13:22:27.810

1

I tested the grammar:

grammar Unicode;

stat: E_CODE* EOF;

E_CODE: '\u00E9' | 'é';

as follows:

UnicodeLexer lexer = new UnicodeLexer(new ANTLRInputStream("\u00E9é"));
UnicodeParser parser = new UnicodeParser(new CommonTokenStream(lexer));
System.out.println(parser.stat().getText());

and the following got printed to my console:

éé<EOF>

Tested with 4.2 and 4.3 (4.4 isn't in Maven Central yet).

EDIT

Looking at the source I see TestRig takes an optional -encoding param. Have you tried setting it?

edited Oct 27 '14 at 13:22

answered Oct 24 '14 at 14:16

Bart Kiers

166,582
36
299
288

Yes, I got the same result, but Test Rig still can't regonize this input. – Adrian Oct 27 '14 at 07:28
Hard to say. Perhaps a terminal thing? Looking at [the source](https://github.com/antlr/antlr4/blob/master/runtime/Java/src/org/antlr/v4/runtime/misc/TestRig.java) I see `TestRig` takes an optional `-encoding` param. Have you tried setting it? – Bart Kiers Oct 27 '14 at 09:07
Now it works. This additional parameter solved this issue. Set -`encoding UTF-8`. – Adrian Oct 27 '14 at 13:13
Cool, I'll add it to my answer. – Bart Kiers Oct 27 '14 at 13:21
I've been struggling with a similar issue. So I tested the code in this answer in v4.9.3, and it worked. But, if I change the grammar to accept `| '↊' | '↋'` instead, and then feed in `"↊↋"` as input, it does not work. So it seems like there's something special about `"é"`? – Hakanai Feb 08 '22 at 22:07
@Hakanai I suggest you create a question of your own on SO. These comment sections are not well suited for Q&A's. – Bart Kiers Feb 08 '22 at 22:10
Yeah, I probably will, but my experience is that it ends up being closed as a duplicate. :( – Hakanai Feb 08 '22 at 22:18
Actually I think this answer might just be misleading. If you have this: `E_CODE: 'é';` it doesn't work. Putting in the explicit escape appears to be what's making it work. I put a question up anyway because it really seems like putting Unicode directly into the grammar _should_ be possible... – Hakanai Feb 08 '22 at 23:03

David Tonhofer · Answer 2 · 2022-10-09T10:19:05.180

This is not an answer but a large comment.

I just hit a snag with Unicode, so I thought I would test this. Turned out I wrongly encoded the input file, but here is the test code, everything is default and working extremely well in ANTLR 4.10.1. Maybe of some use:

grammar LetterNumbers;

text: WORD*;

WS: [ \t\r\n]+ -> skip ; // toss out whitespace

// The letters that return Character.LETTER_NUMBER to Character.getType(ch)
// The list: https://www.compart.com/en/unicode/category/Nl
// Roman Numerals are the best known here

WORD: LETTER_NUMBER+;

LETTER_NUMBER:
[\u16ee-\u16f0]|[\u2160-\u2182]|[\u2185-\u2188]
|'\u3007'
|[\u3021-\u3029]|[\u3038-\u303a]|[\ua6e6-\ua6ef];

And the JUnit5 test that goes with that:

package antlerization.minitest;

import antlrgen.minitest.LetterNumbersBaseListener;
import antlrgen.minitest.LetterNumbersLexer;
import antlrgen.minitest.LetterNumbersParser;
import org.antlr.v4.runtime.Lexer;
import org.antlr.v4.runtime.tree.TerminalNode;
import org.junit.jupiter.api.Test;
import org.antlr.v4.runtime.CharStreams;
import org.antlr.v4.runtime.CommonTokenStream;
import org.antlr.v4.runtime.tree.ParseTree;
import org.antlr.v4.runtime.tree.ParseTreeWalker;

import java.util.LinkedList;
import java.util.List;

import static org.hamcrest.MatcherAssert.assertThat;
import static org.hamcrest.Matchers.*;

public class MiniTest {

    static class WordCollector extends LetterNumbersBaseListener {

        public final List<String> collected = new LinkedList<>();

        @Override
        public void exitText(LetterNumbersParser.TextContext ctx) {
            for (TerminalNode tn : ctx.getTokens(LetterNumbersLexer.WORD)) {
                collected.add(tn.getText());
            }
        }

    }

    private static ParseTree stringToParseTree(String inString) {
        Lexer lexer = new LetterNumbersLexer(CharStreams.fromString(inString));
        CommonTokenStream tokens = new CommonTokenStream(lexer);
        // "text" is the root of the grammar tree
        // this returns a sublcass of ParseTree: LetterNumbersParser.TextContext
        return (new LetterNumbersParser(tokens)).text();
    }

    private static List<String> collectWords(ParseTree parseTree) {
        WordCollector wc = new WordCollector();
        (new ParseTreeWalker()).walk(wc, parseTree);
        return wc.collected;
    }

    private static String joinForTest(List<String> list) {
        return String.join(",",list);
    }

    private static String stringInToStringOut(String parseThis) {
        return joinForTest(collectWords(stringToParseTree(parseThis)));
    }

    @Test
    void unicodeCharsOneWord() {
        String res = stringInToStringOut("ⅣⅢⅤⅢ");
        assertThat(res,equalTo("ⅣⅢⅤⅢ"));
    }

    @Test
    void escapesOneWord() {
        String res = stringInToStringOut("\u2163\u2162\u2164\u2162");
        assertThat(res,equalTo("ⅣⅢⅤⅢ"));
    }

    @Test
    void unicodeCharsMultipleWords() {
        String res = stringInToStringOut("ⅠⅡⅢ ⅣⅤⅥ ⅦⅧⅨ ⅩⅪⅫ ⅬⅭⅮⅯ");
        assertThat(res,equalTo("ⅠⅡⅢ,ⅣⅤⅥ,ⅦⅧⅨ,ⅩⅪⅫ,ⅬⅭⅮⅯ"));
    }

    @Test
    void unicodeCharsLetters() {
        String res = stringInToStringOut("Ⅰ Ⅱ Ⅲ \n Ⅳ Ⅴ Ⅵ \n Ⅶ Ⅷ Ⅸ \n Ⅹ Ⅺ Ⅻ \n Ⅼ Ⅽ Ⅾ Ⅿ");
        assertThat(res,equalTo("Ⅰ,Ⅱ,Ⅲ,Ⅳ,Ⅴ,Ⅵ,Ⅶ,Ⅷ,Ⅸ,Ⅹ,Ⅺ,Ⅻ,Ⅼ,Ⅽ,Ⅾ,Ⅿ"));
    }

}

score -1 · Answer 3 · answered Oct 23 '18 at 14:57

-1

Your grammar file is not saved in utf8 format. Utf8 is default format that antlr accept as input grammar file, according with terence Parr book.

answered Oct 23 '18 at 14:57

Bruno Parrotta

29
4

Antlr4 doesn't correctly recognizes unicode characters

3 Answers3

EDIT

Linked