How to add back comments/whitespaces in translator using the Antlr4's visitor model

Question

I'm currently writing a TSQL (Sybase/Microsoft SQL) to MySQL translator using the ANTLR4 visitor approach.

I'm able to push comments and whitespaces to different channels so that I can use that information later.

What's not super clear is:

how do I get the data back?
and more importantly how do I plug the comments and whitespaces back into my translated MySQL code?

Re: #1, this seems to work to get the list of all tokens including the comments/whitespaces:

public static List<Token> getHiddenTokensFromString(String sqlIn, int hiddenChannel) {
    CharStream charStream = CharStreams.fromString(sqlIn);
    CaseChangingCharStream upper = new CaseChangingCharStream(charStream, true);
    TSqlLexer lexer = new TSqlLexer(upper);
    CommonTokenStream commonTokenStream = new CommonTokenStream(lexer, hiddenChannel);
    commonTokenStream.fill();
    List<Token> hiddenTokens = commonTokenStream.getTokens();
    return hiddenTokens;
}

Re #2, what makes it particularly challenging is that as part of the translation, lines of SQL have to be moved around, some lines removed and some lines added.

Any help will be greatly appreciated.

Thanks.

In the Antlr runtime, there are a few useful methods of CommonTokenStream which is subclassed from [BufferedTokenStream](https://www.antlr.org/api/Java/org/antlr/v4/runtime/BufferedTokenStream.html). Check out getHiddenTokensToLeft() and getHiddenTokensToRight(). If you have a parse tree node, just get the source interval and pass either the left edge or right edge index of that to the getHidden*() method you want to use. That will give you a list of tokens which you can then just get the text and reconstruct the inter-token white space and comments. — kaby76, Jul 19 '20 at 17:48

score 2 · Answer 1 · answered Jul 20 '20 at 06:44

2

The ANTLR4 lexer creates a number of tokens, each with an index (a running number). Provided you didn't just skip a token, all tokens are available for later inspection, once the parsing step is done, regardless of their channels (the channel is actually just a number property on a token).

So, given you have a token you want to translate, get its index and then ask the token stream for the tokens with the next smaller index or next higher index. These are usually the hidden whitespaces.

Once you have the whitespace token use its start and stop index to get the original text from the char stream. And since you know where you are in the translation process when you do that, it should be easy to know where to insert the original text.

answered Jul 20 '20 at 06:44

Mike Lischke

48,925
16
119
181

Thanks, Mike. I need to work on another part of the translator but I will take another shot at putting back comments/whitespaces using your approach when I'm done. – Jerome Provensal Jul 20 '20 at 23:07
Hi Mike, I'm finally back to this. I understand how to get the list of tokens, how to get the associated with the rule I'm visiting but it's not clear to me what the best strategy is in terms of strategic place(s) in the code where I can plug back the comments/whitespaces. What do you mean by "the next smaller/higher index" and what's the method to get them? Thanks! – Jerome Provensal Jul 31 '20 at 18:08
The next smaller or higher index is a pretty simple thing: increment or decrement the index number you got for your token and then use the token stream to get the token for those indices. Check their type to see what they are, but they really should be the hidden whitespaces as defined in the grammar. Now get their original source code and insert that in the target together with your translated code. – Mike Lischke Aug 01 '20 at 08:34
Thanks, Mike. I think I'm doing what you are suggesting but as I'm moving things around (for example MySQL requires all its variables to be declared first before any other operation can take place) comments are left behind. I was moved to another project for another week or so. I'll revisit the problem when I get back to it. – Jerome Provensal Aug 06 '20 at 20:22

How to add back comments/whitespaces in translator using the Antlr4's visitor model

1 Answers1