How to generate arbitrary instances of a language given its concrete syntax in Rascal?

Question

Given the concrete syntax of a language, I would like to define a function "instance" with signature str (type[&T]) that could be called with the reified type of the syntax and return a valid instance of the language.

For example, with this syntax:

lexical IntegerLiteral = [0-9]+;           

start syntax Exp            
  = IntegerLiteral          
  | bracket "(" Exp ")"     
  > left Exp "*" Exp        
  > left Exp "+" Exp        
  ;

A valid return of instance(#Exp) could be "1+(2*3)".

The reified type of a concrete syntax definition does contain information about the productions, but I am not sure if this approach is better than a dedicated data structure. Any pointers of how could I implement it?

Jurgen Vinju · Accepted Answer · 2021-04-25T09:36:48.613

The most natural thing is to use the Tree data-type from the ParseTree module in the standard library. It is the format that the parser produces, but you can also use it yourself. To get a string from the tree, simply print it in a string like so:

str s = "<myTree>";

A relatively complete random tree generator can be found here: https://github.com/cwi-swat/drambiguity/blob/master/src/GenerateTrees.rsc

The core of the implementation is this:

Tree randomChar(range(int min, int max)) = char(arbInt(max + 1 - min) + min);

Tree randomTree(type[Tree] gr) 
  = randomTree(gr.symbol, 0, toMap({ <s, p> | s <- gr.definitions, /Production p:prod(_,_,_) <- gr.definitions[s]}));

Tree randomTree(\char-class(list[CharRange] ranges), int rec, map[Symbol, set[Production]] _)
  = randomChar(ranges[arbInt(size(ranges))]);

default Tree randomTree(Symbol sort, int rec, map[Symbol, set[Production]] gr) {
   p = randomAlt(sort, gr[sort], rec);  
   return appl(p, [randomTree(delabel(s), rec + 1, gr) | s <- p.symbols]);
}

default Production randomAlt(Symbol sort, set[Production] alts, int rec) {
  int w(Production p) = rec > 100 ?  p.weight * p.weight : p.weight;
  int total(set[Production] ps) = (1 | it + w(p) | Production p <- ps);
  
  r = arbInt(total(alts));
  
  count = 0;
  for (Production p <- alts) {
    count += w(p);
    if (count >= r) {
      return p;
    }
  } 
  
  throw "could not select a production for <sort> from <alts>";
}

Tree randomChar(range(int min, int max)) = char(arbInt(max + 1 - min) + min);

It is a simple recursive function which randomly selects productions from a reified grammar.

The trick towards termination lies in the weight of each rule. This is computed a priori, such that every rule has its own weight in the random selection. We take care to give the set of rules that lead to termination at least 50% chance of being selected (as opposed to the recursive rules) (code here: https://github.com/cwi-swat/drambiguity/blob/master/src/Termination.rsc)

Grammar terminationWeights(Grammar g) { 
   deps = dependencies(g.rules);
   weights = ();
   recProds = {p | /p:prod(s,[*_,t,*_],_) := g, <delabel(t), delabel(s)> in deps};
   
   for (nt <- g.rules) {
      prods       = {p | /p:prod(_,_,_) := g.rules[nt]};
      count       = size(prods);
      recCount    = size(prods & recProds);
      notRecCount = size(prods - recProds);
      
      // at least 50% of the weight should go to non-recursive rules if they exist
      notRecWeight = notRecCount != 0 ? (count * 10) / (2 * notRecCount) : 0;
      recWeight = recCount != 0 ? (count * 10) / (2 * recCount) : 0;
      
      weights += (p : p in recProds ? recWeight : notRecWeight | p <- prods); 
   }
       
   return visit (g) { 
       case p:prod(_, _, _) => p[weight=weights[p]]
   }
}

@memo 
rel[Symbol,Symbol] dependencies(map[Symbol, Production] gr) 
  = {<delabel(from),delabel(to)> | /prod(Symbol from,[_*,Symbol to,_*],_) := gr}+;

Note that this randomTree algorithm will not terminate on grammars that are not "productive" (i.e. they have only a rule like syntax E = E;

Also it can generate trees that are filtered by disambiguation rules. So you can check this by running the parser on a generated string and check for parse errors. Also it can generated ambiguous strings.

By the way, this code was inspired by the PhD thesis of Naveneetha Vasudevan of King's College, London.

Thank you very much, it is just what I was looking for. On that note, is the code of drambiguity released under an open-source license? — Jonata Pastro, Apr 26 '21 at 19:08
I haven't chosen a license yet but I imagine it would be bsd-2-ish; go ahead and use it but "caveat emptor", bugs may be included — Jurgen Vinju, Apr 30 '21 at 20:49

How to generate arbitrary instances of a language given its concrete syntax in Rascal?

1 Answers1