5

I have a ID terminator

ID      : ([A-Z_]|'\u0100'..'\uFFFE') ([A-Z_0-9]|'\u0100'..'\uFFFE')*;

and a .txt sample file to parse

均60:=MA(C,60);

I generated Java and Python2 target and test each against sample file respectively. Java target can parse this file. But Python2 target can't. It throws token recognition error at: '均'. And I tested Python2 target against other valid inputs, all works except which contains unicode characters. Did I miss something or python target don't support unicode parsing.

java

mkdir -p java
java -jar /usr/local/lib/antlr-4.5.3-complete.jar TDX.g4 -o ./java
cd ./java
javac TDX*.java
java org.antlr.v4.gui.TestRig TDX prog -gui ../samples/1.txt

python target generating command

java -jar /usr/local/lib/antlr-4.5.3-complete.jar -Dlanguage=Python2 TDX.g4 -o ./tdx_py/antlrgen -visitor

python code

import sys
from antlr4 import *
from tdx_py.antlrgen import TDXLexer, TDXParser

def executefile(file):
    input = FileStream(file, encoding='utf-8')
    lexer = TDXLexer(input)
    stream = CommonTokenStream(lexer)
    parser = TDXParser(stream)
    tree = parser.prog()


if __name__ == '__main__':
    executefile(sys.argv[1])
gzc
  • 8,180
  • 8
  • 42
  • 62

1 Answers1

0

This is a bug of ANTLR4. Reference https://github.com/antlr/antlr4/issues/1925

gzc
  • 8,180
  • 8
  • 42
  • 62