How do I get this encoding right with ANTLR?

Question

I'm working on a project for school. We are making a static code analyzer. A requirement for this is to analyse C# code in Java, which is going so far so good with ANTLR.

I have made some example C# code to scan with ANTLR in Visual Studio. I analyse every C# file in the solution. But it does not work. I am getting a memory leak and the error message :

Exception in thread "main" java.lang.OutOfMemoryError: Java heap space
    at org.antlr.runtime.Lexer.emit(Lexer.java:151)
    at org.antlr.runtime.Lexer.nextToken(Lexer.java:86)
    at org.antlr.runtime.CommonTokenStream.fillBuffer(CommonTokenStream.java:119)
    at org.antlr.runtime.CommonTokenStream.LT(CommonTokenStream.java:238)

After a while I thought it was an issue with encoding, because all the files are in UTF-8. I think it can't read the encoded Stream. So i opened Notepad++ and i changed the encoding of every file to ANSI, and then it worked. I don't really understand what ANSI means, is this one character set or some kind of organisation?

I want to change the encoding from any encoding (probably UTF-8) to this ANSI encoding so i won't get memory leaks anymore.

This is the code that makes the Lexer and Parser:

InputStream inputStream = new FileInputStream(new File(filePath));
CharStream charStream = new ANTLRInputStream(inputStream);
CSharpLexer cSharpLexer = new CSharpLexer(charStream);
CommonTokenStream commonTokenStream = new CommonTokenStream(cSharpLexer);
CSharpParser cSharpParser = new CSharpParser(commonTokenStream);

Does anyone know how to change the encoding of the InputStream to the right encoding?
And what does Notepad++ do when I change the encoding to ANSI?

I'm not sure if sites like Pastebin keep the right encoding. But here is an example: http://pastebin.com/ji8AHcRN — Thomas Schmidt, May 03 '12 at 12:15

score 1 · Answer 1 · answered May 03 '12 at 14:19

1

When reading text files you should set the encoding explicitly. Try you examples with the following change

CharStream charStream = new ANTLRInputStream(inputStream, "UTF-8");

answered May 03 '12 at 14:19

Andrew T Finnell

13,417
3
33
49

I added an answer here for ANTLR4. http://stackoverflow.com/questions/28126507/antlr4-using-non-ascii-characters-in-token-rules/28129510#28129510 – Terence Parr Jan 24 '15 at 19:46

score -1 · Accepted Answer · answered May 09 '12 at 01:26

-1

I solved this issue by putting the ImputStream into a BufferedStream and then removed the Byte Order Mark.

I guess my parser didn't like that encoding, because I also tried set the encoding explicitly.

answered May 09 '12 at 01:26

Thomas Schmidt

84
7

How do I get this encoding right with ANTLR?

2 Answers2