Code substitution (a la #define). Lexer? or Parser?

Question

tl;dr: How do you emulate the equivalent of C's #define with jison without doing a pre-processing step?

I am working on a relatively simple grammar with a feature to assign an identifier to a chunk of code that can then be re-used later for brevity. Example:

# Valid grammar with various elements of different types
foo x.3 y.4 z.5

# Assign an id to a chunk of code. Everything after -> is assigned to id
fill_1-> bar a.1 b.2 c.3

# Use chunk of code later
# Should be equivalent to parsing: "baz d.4 bar a.1 b.2 c.3 e.5"
baz d.4 ["fill_1"] e.5

So far, I've got my parser setup to correctly identify assignment lines of code and store the part to the right of the '->' in a dictionary available to other parser actions. Code related to the define action provided below:

// Lexer
HSPC                      [ \t]
ID                        [a-zA-Z_][a-zA-Z0-9_]*
%%
{ID}{HSPC}*"->"       {
                        this.begin("FINISHLINE"); 
                        yytext = yytext.replace("->", "").trim();
                        return "DEFINE";
                      }

('"'{ID}'"')          { 
                        yytext = yytext.slice(1,-1);
                        return "QUOTED_ID"; 
                      }

<FINISHLINE>.*        {
                        this.begin("INITIAL");
                        yytext = yytext.trim();
                        return "REST_OF_LINE";
                      }

%%

// Parser
statement
  : "[" QUOTED_ID "]"
    { $$ = (defines[$2] ? defines[$2] : ""); }
  | DEFINE  REST_OF_LINE
    { 
      defines[$1] = $2;
    }
  ;

%% 
var defines = {};

How can I get jison to actually tokenize and parse that saved snippet of code? Do I need to take an AST approach? Is there a way to inject the code into the parser? Should this happen in the lexing stage or the parsing stage? Would love to hear multiple strategies one could take with short example snippets.

Thanks!

An aside: Your syntax using double-quote for macro calls is, well, confusing. In particular, it looks much like what normal people consider by convention to be a text string, and they will mistake it as such. There is little point in encouraging such mistakes; I'd choose a different syntax. You want to use the *principle of least surprise* when designing things for others to use. — Ira Baxter, Dec 30 '13 at 07:41
@IraBaxter I think you make a good point, and I will consider updating that syntax. — abelnation, Jan 01 '14 at 23:34

Ira Baxter · Answer 1 · 2013-12-30T14:37:55.200

If by "take an AST approach", you mean "build ASTs for the original unsubstituted program, and for the substititions, and splice them together", you're in for a hard time. There's no guarantee that your substituted string matches any valid nonterminal in your grammar, so it isn't easy to build a tree for it. Your main program before substitutions is also extremely unlikely to be parsable by your full grammar. [You can overcome these difficulties by building substring parsers and doing wizardry with tree fragment gluing, but would be a lot of work [we are doing something like this for a C preprocessor analyzer], and I doubt ANTLR would help you much].

The usual approach for this is to have the lexer keep a stack of partially read input streams, with the bottom stream being the main program, and nested streams corresponding to partially read macro invocations (you need more than one, if one macro can invoke another. Surely your language allows "fill2 -> x.1 [ fill1 ] y.3 "? This means the lexer must:

recognize the macro definitions, so it has access to the mapping between name and macro content
recognize macro invocations (without disturbing the lexing or parsing state; usually this means the macro call has to be recognized by ad hoc machinery in the lexer)
"push" an input stream for the invoked macro on the stream stack
"pop" the stream stack when end of stream is reached

You may someday decide that you need parameters on your macros. You can typically implement these as streams, too.

You could imagine lexing the tokens and storing a token stream rather than text as the macro body; then the macro call detection and the body insertion could occur after the lexer and before the parser. Since there is presumably an interface between the two of them, putting code in between to manage this seems like a practical way to go. A complication may occur if your language allows the same stream of characters to be interpreted differently in different places in the program; how will the macro capture know how to lex the macro content in this case?

I don't know enough (or even much) about ANTLR3 to tell you how to accomplish this in detail.

See my follow-up answer. I think you raise good points. My requirements are a bit relaxed, in that a defined chunk that's valid syntax will be valid anywhere you use it (as long as you are calling the macro in a valid spot). RE parameters on my macro, I actually implemented a "multiply/repeat" operator e.g. `["macro_1"]*3` which allows you to repeat a chunk of code multiple times. Super helpful. — abelnation, Jan 01 '14 at 23:50

score 1 · Answer 2 · answered Jan 02 '14 at 16:31

The usual way to implement a preprocessor like the C preprocessor is to have it process the token stream between the lexer and the parser. So the lexer recognizes tokens, turning the input stream of characters into a stream of tokens, then the preprocessor operates on that stream of tokens, transforming it into new stream of tokens, and then finally the parser parses that stream of tokens.

Now if you're using yacc/lex (or bison/flex), this is a little bit tricky as they're designed to communicate directly via the parser calling yylex, with nothing in between. With flex, you can use the YY_DECL macro to change the declaration of the yylex function it defines, and insert your own yylex function:

%{
#define YY_DECL static int token()
%}
%%
 ... lex rules
%%

int yylex() {
    int tok = token();
    if (tok == IDENTIFIER) {
        Macro *mac = find_macro(yylval.str);
        if (mac) {
            yypush_buffer_state(dummy_buffer);
            yy_scan_string(mac->definition);
            return yylex(); } }
    return tok;
}

With lex, you can use #define yylex token/#undef yylex in the appropriate spots to get the same effect.

score 0 · Answer 3 · answered Jan 01 '14 at 23:45

So after further reading, here's one potential approach that does not require pre-processing.

I am now generating a Abstract Syntax Tree in my parser for different types of nodes. My AST consists of various classes, some representing fundamental units and others representing meta-units. My "macro" are represented with their own AST element, which holds a reference to a collection of parsed AST elements. So, instead of doing "code substitution," where the code is substituted and then parsed multiple times, I am parsing the defined chunk once and storing a reference to the created AST elems. E.g.:

# fundamental units
class TabElement
# subclasses of TabElement
class NoteElement
class SlideElement
class MuteElement

# meta element: holds a collection of TabElements's
class NoteGroupElement
# meta element: holds a collection of TabElements's and otherstatements, has an ID
#               the collection is accessible via a global dict to other elements
class PredefElement
# meta element: represents a usage of a PredefElement
class PredefInvokeElement

Abbreviated lexer and parser rules for doing so look something like the following (lots of extraneous stuff omitted, hopefully you get the picture):

/* LEXER */
<INITIAL>[\s]             /* ignore whitespace */
{ID}{HSPC}*"->"       {
                        this.begin("DEFINE"); 
                        yytext = yytext.replace("->", "").trim();
                        return "DEFINE";
                      }
/* using a pre-def */
('"'{ID}'"')|("'"{ID}"'") { 
                        yytext = yytext.slice(1,-1);
                        return "QUOTED_ID"; 
                      }

/* When defining a code chunk.  Newlines delimit the end of the definition */
<DEFINE>{HSPC}            /* ignore horizontal whitespace */
<DEFINE>({NL}|<<EOF>>)    { this.begin("INITIAL"); return "NL"; }

/* PARSER */
statement
  : statement_group
    { $$ = $1; }
  | DEFINE statements NL
    { 
      defines[$1] = new ast.PredefinedElement($1, $2, @1);
      $$ = defines[$1]
    }
  | predefine_invoke
    { $$ = $1; }
  | chord
    { $1 = $1; }
  | bar_token
    { $$ = $1; }
  | REPEAT_MEASURE
    { $$ = new ast.RepeatMeasureElement(@1); }
  | REST
    { $$ = new ast.RestElement(@1); }
  | DURATION
    { $$ = new ast.DurationElement($1, @1); }
  | note_token
    { $$ = $1; }
  ;

predefine_invoke
  : "[" QUOTED_ID "]"
    %{ 
      if (defines[$2]) { $$ = new ast.PredefinedInvokeElement(defines[$2], @2); } 
      else { $$ = undefined; }
    %}
  | "[" QUOTED_ID "]" "*" INTEGER
    %{ 
      if (defines[$2]) { $$ = new ast.PredefinedInvokeElement(defines[$2], @2, { repeat: parseInt($5) }); } 
      else { $$ = undefined; }
    %}
  ;

Note that I use a different lexer mode to pay attention to Newlines when parsing the right-hand side of the DEFINE — abelnation, Jan 01 '14 at 23:46
How does your parser "parse" the un-preprocessed code? Assume my grammar consists of "G = I '+' N" where I is an identifer, and N is a number. Trying to use your notation, assume my concrete text is fill_1 -> + 3 \n x ["fill_1"] \n where \n is a newline character. I can see how your parser might pick up the macro definition. How does it parse the x [ "fill_1"] part? — Ira Baxter, Jan 02 '14 at 04:06

Code substitution (a la #define). Lexer? or Parser?

3 Answers3