Switch CommonTokenStream to ignore or enable Whitespace

Question

My original grammar uses the skip command to ignore whitespaces in the parsing process.

WS      :   [ \t]+ ->  skip ;

However for refactoring methods I need to send whitespace tokens to a hidden channel to use the TokenStreamRewriter according to this receipe: ANTLR4: TokenStreamRewriter output doesn't have proper format (removes whitespaces)

WS      :   [ \t]+ ->  channel(HIDDEN);

The problem is now that the parser recognizes whitespaces as tokens which I want to avoid in the default parsing process.

Is it possible to switch between two different implementations of the same rule dependent on the regular parsing process or the parsing process for refactoring methods (with the same grammar)?

Do I need semantic predicates for this? Or is there a method available in the CommonTokenStream to skip or enable whitespacces?

Mike Cargal · Accepted Answer · 2016-01-21T14:05:23.957

0

I'm not really sure what is causing your problem. Your expected behavior is correct.

WS [ \t]+ -> channel(HIDDEN)

will move those tokens to a channel that is not processed by the parser. You do not need semantic predicates, or any special calls on CommonTokenStream to make this happen.

This is what I do in my grammar and WS is not seen by the parser (I have a slightly different WS rule, but nothing that should make a difference).

The lexer (aka tokenizer) runs independently of the parser (and before the parser), so the parser can't do anything to impact how the lexer does it's job (for example, which channel a token is placed on).

You may also want to take a look at the following method on your TokenStream:

public List<Token> getTokens(int start, int stop, int ttype)

With that method you can pull out a list of your comment tokens within the start and stop token indices, by supplying the token type of your comment token as the third parameter.

edited Jan 21 '16 at 14:05

answered Jan 19 '16 at 21:12

Mike Cargal

6,610
3
21
27

Hello Mike thanks for the answer. At the moment I skip the Whitespace which gives me the possibility to detect functions, vars, etc. for my source analysis. This works fine. However when I would like to use the TokenStreamRewriter (with Whitespace skip in the grammar) I don't get the whitespaces which I need, e.g. to extract parse and replace the source. So in my refactoring process I need the channel(HIDDEN) grammar in the lexer to do this. So my question is how can I switch between both states in the lexer. Else I have to create two grammars. – Marcel Jan 20 '16 at 09:08
I just wanted to add that I use the CommonTokenStream to extract positions of var and function assignments. The rule channel(HIDDEN) includes whitespaces in the token stream (which I wanted to avoid in my general parsing process but not in my refactoring parsings). – Marcel Jan 20 '16 at 09:13
You might want to hit up section 4.5 of the ANTLR book (the section titles "“Rewriting the Input Stream”). Using a TokenStreamRewriter, your parser will still not "see" the tokens on the hidden channel, but you can use the methods on the TokenStreamRewriter (which does keep up with the hidden channel) to modify the stream and write it back out. After calling your modifier methods, just call getText() on your rewrites and you'll get back the full input stream with you modifications. It will include content from the HIDDEN channel. – Mike Cargal Jan 20 '16 at 10:24
But will the TokenStreamRewriter will also see the skipped tokens? Because I need in my primary grammar the lexer expression: WS : [ \t]+ -> skip ; for the regular parsing. The hidden channel worked for me but only for the refactor parsings. – Marcel Jan 20 '16 at 11:27
Skip channel would work if I would'nt use the CommonTokenStream in the primary parsing process (to access the tokens which I split for the detection of vars and function). Something like: Interval sourceInterval = ctx.getSourceInterval(); int start = sourceInterval.a; Token assign = tokens.get(start + 2); – Marcel Jan 20 '16 at 11:35
TokenStreamRewriter keeps up with tokens on the hidden channel (and will include them in the results of getText(), but will not hand those tokens to the parser when it asks for tokens, thereby making them "hidden" from the parser rules, but allowing it to keep up with them for writing the modified stream back out. I've used them this way. Tokens on the hidden channel don't interfere with parse rules, And I was able to use exactly the same grammar to write a very simple routine to modify my source (maintains all the content from the hidden channel). – Mike Cargal Jan 20 '16 at 11:45
I don't see enough information to make a suggestion as to why the hidden channel didn't work for you in your normal parse. BTW, tokens on the hidden channel shouldn't affect your refactor parse either. Your parse rules should match as if they were not in the token stream. You just have access to them if you need them (always true) and the TokenStreamRewriter gives you convenience methods to modify the stream for retrieval using getText(). If you think are on the hidden channel are being evaluated by your parser, then you have some detail wrong. The Excerpt of grammar you shared looks right – Mike Cargal Jan 20 '16 at 11:56
Hello Mike. Thanks so far for your suggestions. However as far as I understood when I try to get a specific token in my listener for example from the following grammar excerpt: expr ">" expr then I could get the first token in my listener with ctx.getExpr(0) and the last token, e.g., with ctx.getExpr(1). However the token in between I can get by calculating the index from the first token and then I use the CommonTokenStream to get the second token ">" in the rule: Interval sourceInterval = ctx.getSourceInterval(); int start = sourceInterval.a; CommonTokenStream.get(start+1); – Marcel Jan 21 '16 at 08:28
This however doesn't work when there are whitespaces inbetween. Because I use the skip() command (preferred) in the lexer and I use the CommonTokenStream. The only way I can think of to get the ">" token in my listener with the hidden() rule would be not to use the CommonTokenStream and instead access the token with ctx.getParent().get(1). That should work. So I have to rewrite my listeners. However do you know another method to access the middle token from this specific rule as example? – Marcel Jan 21 '16 at 08:28
I meant ctx.getParent().getChild(1) – Marcel Jan 21 '16 at 08:57
In the listener parser.getTokenStream().get(start+1) won't work because of the Whitespace in the TokenStream (as consequence of using hidden() as rule) if I try to detect in a var assignment the '=' Token in x = 2 as a consequence of the parser rule: expr "=" expr – Marcel Jan 21 '16 at 12:25
I've amended my answer with a reference to a method on the token stream that should do what you need. (Per the comments it will look at both on and off channels, so it'll see your comment tokens, and you can restrict your list to just the comments. – Mike Cargal Jan 21 '16 at 14:06
Hello Mike. Thanks for the time and patience to answer this question. I will mark the edited answer as correct because it's a solution to extract the specified tokens from the stream. I will try to get the opposite Tokens (without the Whitespace). – Marcel Jan 21 '16 at 16:15

Switch CommonTokenStream to ignore or enable Whitespace

1 Answers1