4

I've very simple grammar which tries to match 'é' to token E_CODE. I've tested it using TestRig tool (with -tokens option), but parser can't correctly match it. My input file was encoded in UTF-8 without BOM and I've used ANTLR version 4.4. Could somebody else also check this ? I got this output on my console:
line 1:0 token recognition error at: 'Ă'

grammar Unicode;

stat:EOF;  
E_CODE: '\u00E9' | 'é';
Adrian
  • 183
  • 1
  • 11

3 Answers3

1

I tested the grammar:

grammar Unicode;

stat: E_CODE* EOF;

E_CODE: '\u00E9' | 'é';

as follows:

UnicodeLexer lexer = new UnicodeLexer(new ANTLRInputStream("\u00E9é"));
UnicodeParser parser = new UnicodeParser(new CommonTokenStream(lexer));
System.out.println(parser.stat().getText());

and the following got printed to my console:

éé<EOF>

Tested with 4.2 and 4.3 (4.4 isn't in Maven Central yet).

EDIT

Looking at the source I see TestRig takes an optional -encoding param. Have you tried setting it?

Bart Kiers
  • 166,582
  • 36
  • 299
  • 288
  • Yes, I got the same result, but Test Rig still can't regonize this input. – Adrian Oct 27 '14 at 07:28
  • Hard to say. Perhaps a terminal thing? Looking at [the source](https://github.com/antlr/antlr4/blob/master/runtime/Java/src/org/antlr/v4/runtime/misc/TestRig.java) I see `TestRig` takes an optional `-encoding` param. Have you tried setting it? – Bart Kiers Oct 27 '14 at 09:07
  • Now it works. This additional parameter solved this issue. Set -`encoding UTF-8`. – Adrian Oct 27 '14 at 13:13
  • Cool, I'll add it to my answer. – Bart Kiers Oct 27 '14 at 13:21
  • I've been struggling with a similar issue. So I tested the code in this answer in v4.9.3, and it worked. But, if I change the grammar to accept `| '↊' | '↋'` instead, and then feed in `"↊↋"` as input, it does not work. So it seems like there's something special about `"é"`? – Hakanai Feb 08 '22 at 22:07
  • @Hakanai I suggest you create a question of your own on SO. These comment sections are not well suited for Q&A's. – Bart Kiers Feb 08 '22 at 22:10
  • Yeah, I probably will, but my experience is that it ends up being closed as a duplicate. :( – Hakanai Feb 08 '22 at 22:18
  • Actually I think this answer might just be misleading. If you have this: `E_CODE: 'é';` it doesn't work. Putting in the explicit escape appears to be what's making it work. I put a question up anyway because it really seems like putting Unicode directly into the grammar _should_ be possible... – Hakanai Feb 08 '22 at 23:03
0

This is not an answer but a large comment.

I just hit a snag with Unicode, so I thought I would test this. Turned out I wrongly encoded the input file, but here is the test code, everything is default and working extremely well in ANTLR 4.10.1. Maybe of some use:

grammar LetterNumbers;

text: WORD*;

WS: [ \t\r\n]+ -> skip ; // toss out whitespace

// The letters that return Character.LETTER_NUMBER to Character.getType(ch)
// The list: https://www.compart.com/en/unicode/category/Nl
// Roman Numerals are the best known here

WORD: LETTER_NUMBER+;

LETTER_NUMBER:
[\u16ee-\u16f0]|[\u2160-\u2182]|[\u2185-\u2188]
|'\u3007'
|[\u3021-\u3029]|[\u3038-\u303a]|[\ua6e6-\ua6ef];

And the JUnit5 test that goes with that:

package antlerization.minitest;

import antlrgen.minitest.LetterNumbersBaseListener;
import antlrgen.minitest.LetterNumbersLexer;
import antlrgen.minitest.LetterNumbersParser;
import org.antlr.v4.runtime.Lexer;
import org.antlr.v4.runtime.tree.TerminalNode;
import org.junit.jupiter.api.Test;
import org.antlr.v4.runtime.CharStreams;
import org.antlr.v4.runtime.CommonTokenStream;
import org.antlr.v4.runtime.tree.ParseTree;
import org.antlr.v4.runtime.tree.ParseTreeWalker;

import java.util.LinkedList;
import java.util.List;

import static org.hamcrest.MatcherAssert.assertThat;
import static org.hamcrest.Matchers.*;

public class MiniTest {

    static class WordCollector extends LetterNumbersBaseListener {

        public final List<String> collected = new LinkedList<>();

        @Override
        public void exitText(LetterNumbersParser.TextContext ctx) {
            for (TerminalNode tn : ctx.getTokens(LetterNumbersLexer.WORD)) {
                collected.add(tn.getText());
            }
        }

    }

    private static ParseTree stringToParseTree(String inString) {
        Lexer lexer = new LetterNumbersLexer(CharStreams.fromString(inString));
        CommonTokenStream tokens = new CommonTokenStream(lexer);
        // "text" is the root of the grammar tree
        // this returns a sublcass of ParseTree: LetterNumbersParser.TextContext
        return (new LetterNumbersParser(tokens)).text();
    }

    private static List<String> collectWords(ParseTree parseTree) {
        WordCollector wc = new WordCollector();
        (new ParseTreeWalker()).walk(wc, parseTree);
        return wc.collected;
    }

    private static String joinForTest(List<String> list) {
        return String.join(",",list);
    }

    private static String stringInToStringOut(String parseThis) {
        return joinForTest(collectWords(stringToParseTree(parseThis)));
    }

    @Test
    void unicodeCharsOneWord() {
        String res = stringInToStringOut("ⅣⅢⅤⅢ");
        assertThat(res,equalTo("ⅣⅢⅤⅢ"));
    }

    @Test
    void escapesOneWord() {
        String res = stringInToStringOut("\u2163\u2162\u2164\u2162");
        assertThat(res,equalTo("ⅣⅢⅤⅢ"));
    }

    @Test
    void unicodeCharsMultipleWords() {
        String res = stringInToStringOut("ⅠⅡⅢ ⅣⅤⅥ ⅦⅧⅨ ⅩⅪⅫ ⅬⅭⅮⅯ");
        assertThat(res,equalTo("ⅠⅡⅢ,ⅣⅤⅥ,ⅦⅧⅨ,ⅩⅪⅫ,ⅬⅭⅮⅯ"));
    }

    @Test
    void unicodeCharsLetters() {
        String res = stringInToStringOut("Ⅰ Ⅱ Ⅲ \n Ⅳ Ⅴ Ⅵ \n Ⅶ Ⅷ Ⅸ \n Ⅹ Ⅺ Ⅻ \n Ⅼ Ⅽ Ⅾ Ⅿ");
        assertThat(res,equalTo("Ⅰ,Ⅱ,Ⅲ,Ⅳ,Ⅴ,Ⅵ,Ⅶ,Ⅷ,Ⅸ,Ⅹ,Ⅺ,Ⅻ,Ⅼ,Ⅽ,Ⅾ,Ⅿ"));
    }

}
David Tonhofer
  • 14,559
  • 5
  • 55
  • 51
-1

Your grammar file is not saved in utf8 format. Utf8 is default format that antlr accept as input grammar file, according with terence Parr book.