ANTLR4: TokenStreamRewriter output doesn't have proper format (removes whitespaces)

Question

I am using Antlr4 and java7 grammar (source) for modifying an input Java Source file. More specifically, I am using the TokenStreamRewriter class to modify some tokens. The following code is a sample that shows how the tokens are modified:

public class TestListener extends JavaBaseListener {    
   private TokenStreamRewriter rewriter;
   rewriter = new TokenStreamRewriter(tokenStream);
   rewriter.replace(ctx.getStart(), ctx.getStop(), "someText");
}

When I print the altered source code, the white spaces and tabs are removed and the new source file's format is like this:

importjava.util.ArrayList;publicclassMain{publicstaticvoidmain(String[]args{MyTimertimer=newMyTimer();}}

I am using extractor.getText() for printing it back.

Is this a problem of the grammar used or should I use some other method from the TokenStreamRewriter class?

score 26 · Accepted Answer · edited Feb 19 '14 at 22:57

26

The issue is that the lexer is not sending white space to the parser, which means that the rewrite stream doesn't have access to the tokens either. It is because of the skip lexer command:

WS : [ \t\r\n\u000C]+ -> skip ;

You have to change all those to -> channel(HIDDEN) which will send them to the parser on a different channel, making them available in the token stream, but invisible to the parser.

edited Feb 19 '14 at 22:57

Sam Harwell

97,721
20
209
280

answered Feb 19 '14 at 19:43

Terence Parr

5,912
26
32

Thank you very much for your quick reply. The proposed change in the file (Java.g4) worked well. – Mike B Feb 19 '14 at 23:40
1

Within a context, the interval boundaries are stored, and there is way to access the entire inputStream, which retrieves all text, regardless of skip or HIDDEN channel. TokenStreamRewriter is fundamentally broken as it gives neither access to original stream start/stop indexes, nor overloads of node GetText so we can obtain the actual text. GetText() on TokenStreamRewriter serves no purpose. That is why you have to massively hack your grammer. – AUSTX_RJL Dec 27 '20 at 19:50
It may be possible to call TokenStreamRewriter.GetText() token-by-token, keeping track of all the context intervals and adding back the whitespace retrieved from walking the context... – AUSTX_RJL Dec 27 '20 at 20:27
@AUSTX_RJL Calling `TokenStream ts = rewriter.getTokenStream();` and then `for (int i=0; i – David Tonhofer Aug 12 '22 at 14:26
Mysteriously in the book [The Definitive ANTLR 4 Reference](https://pragprog.com/titles/tpantlr2/the-definitive-antlr-4-reference/), the example given on p.52 indicates that `TokenStreamRewriter.getText()` inserts appropriate whitespace, but it does not. Maybe something changed in ANTLR since the book came out (2013? time flies). The book needs a version 2 (and should switch to using Junit5 tests instead of the command line, it's so much more flexible but that's just by the way) – David Tonhofer Aug 12 '22 at 14:43
Mysteriously in the book [The Definitive ANTLR 4 Reference](https://pragprog.com/titles/tpantlr2/the-definitive-antlr-4-reference/), the example given on p.52 indicates that TokenStreamRewriter.getText() inserts appropriate whitespace, but it does not. The book needs a version 2 (and should switch to using Junit5 tests instead of the command line, it's so much more flexible but that's just by the way). Update: Ok, it talks about channels in the *subsequent* chapter. That's confusing. – David Tonhofer Aug 12 '22 at 14:59

ANTLR4: TokenStreamRewriter output doesn't have proper format (removes whitespaces)

1 Answers1

Linked