Code substitution for DSL using ANTLR

Question

The DSL I'm working on allows users to define a 'complete text substitution' variable. When parsing the code, we then need to look up the value of the variable and start parsing again from that code.

The substitution can be very simple (single constants) or entire statements or code blocks. This is a mock grammar which I hope illustrates my point.

grammar a;

entry
  : (set_variable
  | print_line)*
  ;

set_variable
  : 'SET' ID '=' STRING_CONSTANT ';'
  ;

print_line
  : 'PRINT' ID ';'
  ;

STRING_CONSTANT: '\'' ('\'\'' | ~('\''))* '\'' ;

ID: [a-z][a-zA-Z0-9_]* ;

VARIABLE: '&' ID;

BLANK: [ \t\n\r]+ -> channel(HIDDEN) ;

Then the following statements executed consecutively should be valid;

SET foo = 'Hello world!';
PRINT foo;            

SET bar = 'foo;'
PRINT &bar                    // should be interpreted as 'PRINT foo;'

SET baz = 'PRINT foo; PRINT'; // one complete statement and one incomplete statement
&baz foo;                     // should be interpreted as 'PRINT foo; PRINT foo;'

Any time the & variable token is discovered, we immediately switch to interpreting the value of that variable instead. As above, this can mean that you set up the code in such a way that is is invalid, full of half-statements that are only completed when the value is just right. The variables can be redefined at any point in the text.

Strictly speaking the current language definition doesn't disallow nesting &vars inside each other, but the current parsing doesn't handle this and I would not be upset if it wasn't allowed.

Currently I'm building an interpreter using a visitor, but this one I'm stuck on.

How can I build a lexer/parser/interpreter which will allow me to do this? Thanks for any help!

That's some nasty trickery to account for in your grammar. Are there any restriction where, and how many, `VARIABLE`s can occur in a single `entry`? I mean, is this allowed: `SET a = 'P'; SET b = 'R'; SET c = 'I'; SET d = 'N'; SET e = 'T'; SET f = ' '; SET g = ''''; SET h = 'ouch!'; SET i = ''''; SET j = ';'; &a&b&c&d&e&f&g&h&i&j` to eventually evaluate `PRINT 'ouch!';`? — Bart Kiers, Mar 13 '14 at 15:10
Yes, that would indeed be a valid statement :/ I doubt anyone ever uses it that way, but the application has been around for many years so you can never be sure what customers have done. The current implementation when reading in characters to make up tokens simply switches to reading from the variable value instead, but I don't know if/how that is compatible with ANTLR. — Trasvi, Mar 13 '14 at 15:31
I don't think there's an easy way to insert code/tokens during parsing. At least not with the provided API classes (you could of course implement your own `TokenStream` and feed that to the parser). — Bart Kiers, Mar 13 '14 at 22:16
@500-InternalServerError I believe that would raise an error. First the value of baz is set to "&baz". Then the text &baz is replaced by the value of baz, "&baz". Then we read in '&', which is not a valid symbol at that point. At least, that is how it currently works. — Trasvi, Mar 14 '14 at 05:00
Some solutions come to mind though I have no idea whether they're possible with Antler: 1/ Can you, while parsing the grammar, throw away everything that's currently sitting in the lexer-to-parser "queue" (resetting the lexer input stream pointer) then inject text in the front of the lexer input stream? (cont) — paxdiablo, Mar 14 '14 at 05:23
2/ Can you run a simpler pre-parse over the input stream to process `set` and `&whatever` variables, constructing an input stream for Antler that has no `&whatever` bits left? (cont) — paxdiablo, Mar 14 '14 at 05:24
3/ Can you add restrictions that will disallow stupidities like `&`-vars that don't hold entire tokens, or things like `set bob = set; &bob xyzzy = plugh;` :-) — paxdiablo, Mar 14 '14 at 05:25

Trasvi · Accepted Answer · 2014-03-15T03:06:42.800

So I have found one solution to the issue. I think it could be better - as it potentially does a lot of array copying - but at least it works for now.

EDIT: I was wrong before, and my solution would consume ANY & that it found, including those in valid locations such as inside string constants. This seems like a better solution:

First, I extended the InputStream so that it is able to rewrite the input steam when a & is encountered. This unfortunately involves copying the array, which I can maybe resolve in the future:

MacroInputStream.java

    package preprocessor;

    import org.antlr.v4.runtime.ANTLRInputStream;

    public class MacroInputStream extends ANTLRInputStream {

      private HashMap<String, String> map;

      public MacroInputStream(String s, HashMap<String, String> map) {
        super(s);
        this.map = map;
      }

      public void rewrite(int startIndex, int stopIndex, String replaceText) {
        int length = stopIndex-startIndex+1;
        char[] replData = replaceText.toCharArray();
        if (replData.length == length) {
          for (int i = 0; i < length; i++) data[startIndex+i] = replData[i];
        } else {
          char[] newData = new char[data.length+replData.length-length];
          System.arraycopy(data, 0, newData, 0, startIndex);
          System.arraycopy(replData, 0, newData, startIndex, replData.length);
          System.arraycopy(data, stopIndex+1, newData, startIndex+replData.length, data.length-(stopIndex+1));
          data = newData;
          n = data.length;
        }
      }
    }

Secondly, I extended the Lexer so that when a VARIABLE token is encountered, the rewrite method above is called:

MacroGrammarLexer.java

package language;

import language.DSL_GrammarLexer;

import org.antlr.v4.runtime.Token;

import java.util.HashMap;

public class MacroGrammarLexer extends MacroGrammarLexer{

  private HashMap<String, String> map;

  public DSL_GrammarLexerPre(MacroInputStream input, HashMap<String, String> map) {
    super(input);
    this.map = map;
    // TODO Auto-generated constructor stub
  }

  private MacroInputStream getInput() {
    return (MacroInputStream) _input;
  }

  @Override
  public Token nextToken() {
    Token t = super.nextToken();
    if (t.getType() == VARIABLE) {
      System.out.println("Encountered token " + t.getText()+" ===> rewriting!!!");
      getInput().rewrite(t.getStartIndex(), t.getStopIndex(),
          map.get(t.getText().substring(1)));
      getInput().seek(t.getStartIndex()); // reset input stream to previous
      return super.nextToken();
    }
    return t;   
  }   

}

Lastly, I modified the generated parser to set the variables at the time of parsing:

DSL_GrammarParser.java

    ...
    ...
    HashMap<String, String> map;  // same map as before, passed as a new argument.
    ...
    ...

public final SetContext set() throws RecognitionException {
  SetContext _localctx = new SetContext(_ctx, getState());
    enterRule(_localctx, 130, RULE_set);
    try {
        enterOuterAlt(_localctx, 1);
        {
        String vname = null; String vval = null;              // set up variables
        setState(1215); match(SET);
        setState(1216); vname = variable_name().getText();    // set vname
        setState(1217); match(EQUALS);
        setState(1218); vval = string_constant().getText();   // set vval
        System.out.println("Found SET " + vname +" = " + vval+";");
            map.put(vname, vval);
        }
    }
    catch (RecognitionException re) {
        _localctx.exception = re;
        _errHandler.reportError(this, re);
        _errHandler.recover(this, re);
    }
    finally {
        exitRule();
    }
    return _localctx;
}
    ...
    ...

Unfortunately this method is final so this will make maintenance a bit more difficult, but it works for now.

score -1 · Answer 2 · edited May 23 '17 at 12:24

-1

The standard pattern to handling your requirements is to implement a symbol table. The simplest form is as a key:value store. In your visitor, add var declarations as encountered, and read out the values as var references are encountered.

As described, your DSL does not define a scoping requirement on the variables declared. If you do require scoped variables, then use a stack of key:value stores, pushing and popping on scope entry and exit.

See this related StackOverflow answer.

Separately, since your strings may contain commands, you can simply parse the contents as part of your initial parse. That is, expand your grammar with a rule that includes the full set of valid contents:

set_variable
   : 'SET' ID '=' stringLiteral ';'
   ;

stringLiteral: 
   Quote Quote? ( 
     (    set_variable
        | print_line
        | VARIABLE
        | ID
     )
     | STRING_CONSTANT  // redefine without the quotes
   )
   Quote
   ;

edited May 23 '17 at 12:24

Community

1
1

answered Mar 14 '14 at 01:02

GRosenberg

5,843
2
19
23

1

I don't think that this would work. There is another system of variables which is implemented as you describe, but the &VAR system is a code replacement, more like a preprocessor macro, which can contain malformed statements. So, `SET foo = '''BAR'; SET baz = &foo';` would be valid statements, but I couldn't possibly account for it by redefining stringLiteral as you describe. I essentially need &foo to tokenize as `'` + `BAR` instead. – Trasvi Mar 14 '14 at 05:23
To handle malformed statements, extend ANTLRErrorListener and ANTLRErrorStrategy. That will allow you to intelligently handle the malformed statement directly within the the parsing operation. If two sequential quotes define an escaped quote, then list it as a valid subrule as shown in the edited answer. – GRosenberg Mar 14 '14 at 18:38

Code substitution for DSL using ANTLR

2 Answers2