It's not impossible to solve this problem within a bison-generated parser, but it's not easy, and the amount of hackery required might detract from the readability and verifiability of the grammar.
To be clear, GLR parsers are not fallback parsers. The GLR algorithm explores all possible parses in parallel, and rejects invalid ones as it goes. (The bison implementation requires that the parse converge to a single possible parse; the original GLR algorithm produces forest of parse trees.) Also, the GLR algorithm does not contemplate multiple lexical analyses.
If you want to solve this problem in the context of the parser, you'll probably need to introduce special handling for whitespace, or at least for - which are not surrounded by whitespace. Otherwise, you will not be able to distinguish between a - b
(presumably always subtraction) and a-b
(which might be the variable a-b
if that variable were defined). Leaving aside that issue, you would be looking for something like this (but this won't work, as explained below):
expr : term
| expr '-' term
term : factor
| term '*' factor
factor: var
| '(' expr ')'
var : ident { if (!isVariable($1)) { /* reject this production */ } }
ident : WORD
| ident '-' WORD { $$ = concatenate($1, "-", $3); }
This won't work because the action associated with var : ident
is not executed until after the parse has been disambiguated. So if the production is rejected, the parse fails, because the parser has already determined that the production is necessary. (Until the parser makes that determination, actions are deferred.)
Bison allows GLR grammars to use semantic predicates, which are executed immediately instead of being deferred. But that doesn't help, because semantic predicates cannot make use of computed semantic values (since the semantic value computations are still deferred when the semantic predicate is evaluated). You might think you could get around this by making the computation of the concatenated identifier (in the second ident
production) a semantic predicate, but then you run into another limitation: semantic predicates do not themselves have semantic values.
Probably there is a hack which will get around this problem, but that might leave you with a different problem. Suppose that a
, c
, a-b
and b-c
are defined variables. Then, what is the meaning of a-b-c
? Is it (a-b) - c
or a - (b-c)
or an error?
If you expect it to be an error, then there is no problem since the GLR parser will find both possible parses and bison-generated GLR parsers signal a syntax error if the parse is ambiguous. But then the question becomes: is a-b-c
only an error if it is ambiguous? Or is it an error because you cannot use a subtraction operator without surround whitespace if its arguments are hyphenated variables? (So that a-b-c
can only be resolved to (a - b) - c
or to (a-b-c)
, regardless of whether a-b
and b-c
exist?) To enforce the latter requirement, you'll need yet more complication.
If, on the other hand, your language is expected to model a "fallback" approach, then the result should be (a-b) - c
. But making that selection is not a simple merge procedure between two expr
reductions, because of the possibility of a higher precedence *
operator: d * a-b-c
either resolves to (d * a-b) - c
or (d * a) - b-c
; in those two cases, the parse trees are radically different.
An alternative solution is to put the disambiguation of hyphenated variables into the scanner, instead of the parser. This leads to a much simpler and somewhat clearer definition, but it leads to a different problem: how do you tell the scanner when you don't want the semantic disambiguation to happen? For example, you don't want the scanner to insist on breaking up a variable name into segments when you the name occurs in a declaration.
Even though the semantic tie-in with the scanner is a bit ugly, I'd go with that approach in this case. A rough outline of a solution is as follows:
First, the grammar. Here I've added a simple declaration syntax, which may or may not have any resemblance to the one in your grammar. See notes below.
expr : term
| expr '-' term
term : factor
| term '*' factor
factor: VARIABLE
| '(' expr ')'
decl : { splitVariables(false); } "set" VARIABLE
{ splitVariables(true); } '=' expr ';'
{ addVariable($2); /* ... */ }
(See below for the semantics of splitVariables
.)
Now, the lexer. Again, it's important to know what the intended result for a-b-c
is; I'll outline two possible strategies. First, the fallback strategy, which can be implemented in flex:
int candidate_len = 0;
[[:alpha:]][[:alnum:]]*/"-"[[:alpha:]] { yymore();
candidate_len = yyleng;
BEGIN(HYPHENATED);
}
[[:alpha:]][[:alnum:]]* { yylval.id = strdup(yytext);
return WORD;
}
<HYPHENATED>"-"[[:alpha:]][[:alnum:]]*/"-"[[:alpha:]] {
yymore();
if (isVariable(yytext))
candidate_len = yyleng;
}
<HYPHENATED>"-"[[:alpha:]][[:alnum:]]* { if (!isVariable(yytext))
yyless(candidate_len);
yylval.id = strdup(yytext);
BEGIN(INITIAL);
return WORD;
}
That uses yymore
and yyless
to find the longest prefix sequence of hyphenated words which is a valid variable. (If there is no such prefix, it chooses the first word. An alternative would be to select the entire sequence if there is no such prefix.)
A similar alternative, which only allows the complete hyphenated sequence (in the case where that is a valid variable) or individual words. Again, we use yyless and yymore, but this time we don't bother checking intermediate prefixes and we use a second start condition for the case where we know we're not going to combine words:
int candidate_len = 0;
[[:alpha:]][[:alnum:]]*/"-"[[:alpha:]] { yymore();
candidate_len = yyleng;
BEGIN(HYPHENATED);
}
[[:alpha:]][[:alnum:]]* { yylval.id = strdup(yytext);
return WORD;
}
<HYPHENATED>("-"[[:alpha:]][[:alnum:]]*)*[[:alpha:]][[:alnum:]]* {
if (isVariable(yytext)) {
yylval.id = strdup(yytext);
BEGIN(INITIAL);
return WORD;
} else {
yyless(candidate_len);
yylval.id = strdup(yytext);
BEGIN(NO_COMBINE);
return WORD;
}
}
<NO_COMBINE>[[:alpha:]][[:alnum:]]* { yylval.id = strdup(yytext);
return WORD;
}
<NO_COMBINE>"-" { return '-'; }
<NO_COMBINE>.|\n { yyless(0); /* rescan */
BEGIN(INITIAL);
}
Both of the above solutions use isVariable
to decide whether or not a hyphenated sequence is a valid variable. As mentioned earlier, there must be a way to turn off the check, for example in the case of a declaration. To accomplish this, we need to implement splitVariables(bool)
. The implementation is straightforward; it simply needs to set a flag visible to isVariable
. If the flag is set to true, then isVariable
always returns true without actually checking for the existence of the variable in the symbol table.
All of that assumes that the symbol table and the splitVariables flag are shared between the parser and the scanner. A naïve solution would make both of these variables globals; a cleaner solution would be to use a pure parser and lexer, and pass the symbol table structure (including the flag) from the main program into the parser, and from there (using %lex-param
) into the lexer.