ANTLR Grammar to Preprocess Source Files While Preserving WhiteSpace Formatting

Question

I am trying to preprocess my C++ source files by ANTLR. I would like to output an input file preserving all the whitespace formatting of the original source file while inserting some new source codes of my own at the appropriate locations.

I know preserving WS requires this lexer rule:

WS: (' '|'\n'| '\r'|'\t'|'\f' )+ {$channel=HIDDEN;};

With this my parser rules would have a $text attribute containing all the hidden WS. But the problem is, for any parser rule, its $text attribute only include those input text starting from the position that matches the first token of the rule. For example, if this is my input (note the formatting WS before and in between the tokens):

line   1;     line   2;

And, if I have 2 separate parser rules matching

"line   1;"

and

"line   2;"

above separately but not the whole line:

"    line   1;     line   2;"

, then the leading WS and those WS in between "line 1" and "line 2" are lost (not accessible by any of my rules).

What should I do to preserve ALL THE WHITESPACEs while allowing my parser rules to determine when to add new codes at the appropriate locations?

EDIT

Let's say whenever my code contains a call to function(1) using 1 as the parameter but not something else, it adds an extraFunction() before it:

void myFunction() {
   function();
   function(1);
}

Becomes:

void myFunction() {
   function();
   extraFunction();
   function(1);
}

This preprocessed output should remain human readable as people would continue coding on it. For this simple example, text editor can handle it. But there are more complicated cases that justify the use of ANTLR.

Your question can now only be answered by telling you to remove `{$channel=HIDDEN;}`, but I realize that by doing so, most of your parser rules will then need to be littered with optional `WS` tokens: something that is most probably not a workable situation. Perhaps you should explain what pre-processing you're performing on your C++ sources: perhaps I, or someone else, can suggest another way to go about this. — Bart Kiers, Sep 16 '11 at 12:23
My preprocessor needs to understand C++ grammar, say adding a function call before a specific function call,etc. Is it possible to add both *Lexer* actions AND *parser* actions to print output at the same time? Lexer actions print out all the raw input it receives, while parser actions print what they want to output at various input positions. I tried this but probably due to lookahead, when my parser actions output something, they're printed one token after the corresponding input stream. Is it possible to "synchronize" the reading of input stream? — JavaMan, Sep 16 '11 at 12:37
Could you add some concrete examples of your C++ sources (example input, and example output)? — Bart Kiers, Sep 16 '11 at 12:39
You want to preserve the *exact* whitespacing? Because you are going to inspect it? (Doesn't sound like it). I'm guessing you really want to preserve the *layout* (different whitespace might be OK if columns are preserved) of the original code so the programmers don't hate you after you've modified it. But then they'll hate you for having expanded the processor directives which also loses the comments. Can you explain your rationale for preserving whitespace better, and why programmers will acccept your preprocessed code? — Ira Baxter, Sep 16 '11 at 13:53
Yes, i mean preserving the layout only. My grammar is just a very small subset of C++ that ignores #include, comments and almost everything else (because it can assume the input is already a valid .cpp file) except those things I would like to understanding (function body as many semi-colon terminated list of identifiers). — JavaMan, Sep 16 '11 at 14:17
Perhaps all that is needed is for you to know when to indent? Keep track of the scopes, so that you can indent the correct number of times when injecting code into the layout. — beefyhalo, Sep 16 '11 at 16:12
@JavaMan: You didn't address my question as to why you were preserving whitespace, after losing macros, preprocessor conditionals and comments. Either the end programmers are going to see the code (and will object if you lose there), or they won't, in which case what you as you modify the code in terms of layout doesn't matter. — Ira Baxter, Sep 16 '11 at 19:25
Maybe I shouldn't say preserving whitespace. Rather, it is to preserve everything in the hidden channel. Say, if I want to auto generate comments for all the functions (number and type of parameters, return values, etc), then the generated source file should have all the existing comments together with my auto generated comments. — JavaMan, Sep 17 '11 at 11:27
@JavaMan: OK, you want to preserve comments. Earlier you said you wanted to preserve layout; is that answer still valid? What about expanded preprocessor conditionals and macros (they're not in the "hidden channel")? You've avoided answering the question about whether you think programmers will accept the code with these removed. — Ira Baxter, Sep 18 '11 at 11:05
@JavaMan: Not an ANTLR answer, but maybe this is what you are trying to do: "Branch Coverage for Arbitrary Languages Made Easy" www.semdesigns.com/Company/Publications/TestCoverage.pdf — Ira Baxter, Sep 18 '11 at 11:10

score 2 · Accepted Answer · answered Sep 16 '11 at 12:51

2

Another solution, but maybe also not very practical (?): You can collect all Whitespaces backwards, something like this untested pseudocode:

grammar T;

@members {
    public printWhitespaceBetweenRules(Token start) {
        int index = start.getTokenIndex() - 1;

        while(index >= 0) {
            Token token = input.get(index);
            if(token.getChannel() != Token.HIDDEN_CHANNEL) break;
            System.out.print(token.getText());
            index--;
        }
    }
}

line1: 'line' '1' {printWhitespaceBetweenRules($start); };
line2: 'line' '2' {printWhitespaceBetweenRules($start); };
WS: (' '|'\n'| '\r'|'\t'|'\f' )+ {$channel=HIDDEN;};

But you would still need to change every rule.

answered Sep 16 '11 at 12:51

Sonson

1,129
1
11
14

Is this solution restricted to Java as the target lang? I may need to change to C in the future although Java is what I am using now. – JavaMan Sep 16 '11 at 14:38
I also didn't test it, but it looks like this might work: looking backwards from the current token to see how many indented spaces it has. Nicely thought of! – Bart Kiers Sep 16 '11 at 17:57
Some people call the backward number of whitespaces to the line start, the "current column number". ANTLR doesn't collect this as a matter of course on a token? [Messy complication: if there's a tab character, what column number does it start in? end in?] – Ira Baxter Sep 16 '11 at 20:09
The solution does work and is what I need (probably, I'd insert the text from each hidden channel token into a StringBuffer and combine it with $text). Now it is curious why there is no standard attribute to retrieve these hidden channel tokens. Isn't the whole point of hidden channel is to allow the parser to access the whitespace/comments. But currently we can only use $text to retrieve a multiline comment embedded within a statement matching one of our rule but not lying in between 2 statements. – JavaMan Sep 17 '11 at 09:24
Worse, single line comments are mostly likely not accessible at all since they are unlikely embedded within tokens matching any rules. So, the whole idea of hidden channel seems not working currently without custom codes. – JavaMan Sep 17 '11 at 09:32

score 1 · Answer 2 · answered Sep 16 '11 at 12:17

1

I guess one solution is to keep the WS tokens in the same channel by removing the $channel = HIDDEN;. This will allow you to get access to the information of a WS token in your parser.

answered Sep 16 '11 at 12:17

beefyhalo

1,691
2
21
33

2

Although you are correct, it would mean that most (maybe all?) production rules would then need to be littered with optional `WS` tokens, causing the grammar to become an utter mess. So this is not likely to be a proper solution... – Bart Kiers Sep 16 '11 at 12:26

Bart Kiers · Answer 3 · 2011-09-16T19:09:36.797

Here's another way to solve it (at least the example you posted).

So you want to replace ...function(1) with ...extraFunction();\nfunction(1), where the dots are indents, and \n a line break.

What you could do is match:

Function1
  :  Spaces 'function' Spaces '(' Spaces '1' Spaces ')' 
  ;

fragment Spaces
  :  (' ' | '\t')*
  ;

and replace that with the text it matches, but pre-pended with your extra method. However, the lexer will now complain when it stumbles upon input like:

'function()'

(without the 1 as a parameter)

or:

'    x...'

(indents not followed by the f from function)

So, you'll need to "branch out" in your Function1 rule and make sure you only replace the proper occurrence.

You also must take care of occurrences of function(1) inside string literals and comments, assuming you don't want them to be pre-pended with extraFunction();\n.

A little demo:

grammar T;

parse
  :  (t=. {System.out.print($t.text);})* EOF
  ;

Function1
  :  indent=Spaces 
     ( 'function' Spaces '(' Spaces ( '1' Spaces ')' {setText($indent.text + "extraFunction();\n" + $text);}
                                    | ~'1' // do nothing if something other than `1` occurs
                                    )
     | '"' ~('"' | '\r' | '\n')* '"'       // do nothing in case of a string literal
     | '/*' .* '*/'                        // do nothing in case of a multi-line comment
     | '//' ~('\r' | '\n')*                // do nothing in case of a single-line comment
     | ~'f'                                // do nothing in case of a char other than 'f' is seen
     )
  ;

OtherChar
  :  . // a "fall-through" rule: it will match anything if none of the above matched
  ;

fragment Spaces
  :  (' ' | '\t')* // fragment rules are only used inside other lexer rules
  ;

You can test it with the following class:

import org.antlr.runtime.*;

public class Main {
  public static void main(String[] args) throws Exception {
    String source = 
        "/*                      \n" +
        "  function(1)           \n" +
        "*/                      \n" +
        "void myFunction() {     \n" +
        "   s = \"function(1)\"; \n" + 
        "   function();          \n" + 
        "   function(1);         \n" + 
        "}                       \n";
    System.out.println(source);
    System.out.println("---------------------------------");
    TLexer lexer = new TLexer(new ANTLRStringStream(source));
    TParser parser = new TParser(new CommonTokenStream(lexer));
    parser.parse();
  }
}

And if you run this Main class, you will see the following being printed to the console:

bart@hades:~/Programming/ANTLR/Demos/T$ java -cp antlr-3.3.jar org.antlr.Tool T.g
bart@hades:~/Programming/ANTLR/Demos/T$ javac -cp antlr-3.3.jar *.java
bart@hades:~/Programming/ANTLR/Demos/T$ java -cp .:antlr-3.3.jar Main

/*                      
  function(1)           
*/                      
void myFunction() {     
   s = "function(1)"; 
   function();          
   function(1);         
}                       

---------------------------------
/*                      
  function(1)           
*/                      
void myFunction() {     
   s = "function(1)"; 
   function();          
   extraFunction();
   function(1);         
}

I'm sure it's not fool-proof (I did't account for char-literals, for one), but this could be a start to solve this, IMO.

ANTLR Grammar to Preprocess Source Files While Preserving WhiteSpace Formatting

EDIT

3 Answers3