1
"end"  { return 'END'; }
...
0[xX][0-9a-fA-F]+ { return 'NUMBER'; }
[A-Za-z_$][A-Za-z0-9_$]* { return 'IDENT'; }
...
Call
  : IDENT ArgumentList
    {{ $$ = ['CallExpr', $1, $2]; }}
  | IDENT
    {{ $$ = ['CallExprNoArgs', $1]; }}
  ;

CallArray
  : CallElement
    {{ $$  = ['CallArray', $1]; }}
  ;

CallElement
  : CallElement "." Call
     {{ $$ = ['CallElement', $1, $3]; }}
  | Call
  ;

Hello! So, in my grammar I want "res.end();" to not detect end as a keyword, but as an ident. I've been thinking for a while about this one but couldn't solve it. Does anyone have any ideas? Thank you!

edit: It's a C-like programming language.

2 Answers2

1

There's not quite enough information in the question to justify the assumptions I'm making here, so this answer may be inexact.

Let's suppose we have a somewhat Lua-like language in which a.b is syntactic sugar for a["b"]. Furthermore, since the . must be followed by a lexical identifier -- in other words, it is never followed by a syntactic keyword -- we'd like to inhibit keyword recognition in this context.

That's a pretty simple rule. It's simple enough that the lexer could implement it without any semantic information at all; all that it says is that the token which follows a . must be an identifier. In this context, keywords should be treated as identifiers, and anything else other than an identifier is an error.

We can do this with start conditions. Specifically, we define a start condition which is only used after a . token:

%x selector

%%
/* White space and comment rules need to explicitly include
 * the selector condition
 */
<INITIAL,selector>\s+   ;

/* Other rules, including keywords, are unmodified */
"end"                   return "END";

/* The dot rule triggers a new start condition */
"."                     this.begin("selector"); return ".";

/* Outside of the start condition, identifiers don't change state. */
[A-Za-z_]\w*            yylval = yytext; return "ID";
/* Only identifiers are valid in this start condition, and if found
 * the start condition is changed back. Anything else is an error.
 */
<selector>[A-Za-z_]\w*  yylval = yytext; this.popState(); return "ID";
<selector>.             parse_error("Expecting identifier");
rici
  • 234,347
  • 28
  • 237
  • 341
0

Modify your parser, so it always knows what it is expecting to read next (that will be some set of tokens, you can compute this using the notion of First(x) for x being any nonterminal).

When lexing, have the lexer ask the parser what set of tokens it expects next. Your keywork reconizer for 'end' asks the parser, and it either ways "expecting 'end'" at which pointer the lexer simply hands on the 'end' lexeme, or it says "expecting ID" at which point it hands the parser an ID with name text "end".

This may or may not be convenient to get your parser to do. But you need something like this.

We use a GLR parser; our parser accepts multiple tokens in the same place. Our solution is to generate both the 'end' keyword and and the identifier with text "end" and shove them both into the GLR parser. It can handle local ambiguity; the multiple parses caused by this proceed until the parser with the wrong assumption encounters a syntax error, and then it just vanishes, by fiat. The last standing parser is the one with the right set of assumptions. This scheme is somewhat like the first one, just that we hand the parser the choices and it decides rather than making the lexer decide.

You might be able to send your parser a "two-interpretation" lexeme, e.g., a keyword-in-context lexeme, which in essence claims it it both a keyword and/or an identifier. With a single token lookahead internally, the parser can likely decide easily and restamp the lexeme. Not as general as the GLR solution, but probably works in a lot of cases.

Ira Baxter
  • 93,541
  • 22
  • 172
  • 341