Bison: distinguish variables by type / YYERROR in GLR-parsers

Question

I'm working on a parser, using GNU bison, and I'm facing an interesting problem. My actual problem is a bit different and less interesting in general, so I will state it a bit differently, so that an answer would be more useful in general.

I need to distinguish expressions depending on their type, like arithmetic expressions from string expressions. Their top-level nonterminals have some common ancestors, like

statement
    : TOK_PRINT expression TOK_SEMICOLON {...}
;
expression
    : arithmeticexpression {...}
    | stringexpression {...}
;

Now I need to be able to have variables in both kinds of expressions

arithmeticexpression
    : TOK_IDENTIFIER {...}
;
stringexpression
    : TOK_IDENTIFIER {...}
;

(only the variables of string-type should be allowed in the stringexpression case and only variables of int- or float-type should be allowed in the arithmeticexpression case) But this obviously causes an R/R conflict, which is inherent in the language - it is impossible to resolve, because the language is ambiguous.

Of course I could water down the language, so that only general "Expression" objects are passed around on the parse-stack, but then I'd have to do a lot of manual type-checking in the actions, which I would like to avoid. Also, in my real use-case, the paths through the grammar to the variable-rules are so different, that I would have to water the language down so much that I would lose a lot of grammar-rules (i.e. lose a lot of structural information) and would need to hand-write parsing-engines into some actions.

I have read about GLR-parsing, which sounds like it could solve my problem. I'm considering to use this feature: have the grammar as above and YYERROR in the paths where the corresponding variable has the wrong type.

arithmeticexpression
    : TOK_IDENTIFIER {
          if(!dynamic_cast<IntVariable*>(
              symbol_table[*$<stringvalue>1]))
              YYERROR;
      }
;
stringexpression
    : TOK_IDENTIFIER {
          if(!dynamic_cast<StringVariable*>(
              symbol_table[*$<stringvalue>1]))
              YYERROR;
      }
;

but the Bison Manual says

During deterministic GLR operation, the effect of YYERROR is the same as its effect in a deterministic parser.

The effect in a deferred action is similar, but the precise point of the error is undefined; instead, the parser reverts to deterministic operation, selecting an unspecified stack on which to continue with a syntax error.

In a semantic predicate (see Semantic Predicates) during nondeterministic parsing, YYERROR silently prunes the parse that invoked the test.

I'm not sure I understand this correctly - I understand it this way:

doesn't apply here, because GLR-parsing isn't deterministic
is how I have it in the code above, but shouldn't be done this way, because it is completely random which path is killed by the YYERROR (correct me if I'm wrong)
wouldn't solve my problem, because semantic predicates (not semantic actions!) must be at the beginning of a rule (correct me if I'm wrong), at which point the yylval from the TOK_IDENTIFIER token isn't accessible (correct me if I'm wrong), so I can't look into the symbol table to find the type of the variable.

Does anyone have experience with such a situation? Is my understanding of the manual wrong? How would you approach this problem? To me, the problem seems to be so natural that I would assume people run into it often enough that bison would have a solution built-in...

rici · Answer 1 · 2018-06-14T16:27:34.627

Note: Since at least bison 3.0.1 and as of bison 3.0.5, there is a bug in the handling of semantic predicate actions which causes bison to output a #line directive in the middle of a line, causing the compile to fail. A simple fix is described in this answer, and has been committed to the maint branch of the bison repository. (Building bison from the repository is not for the faint-hearted. But it would suffice to download data/c.m4 from that repository if you didn't want to just edit the file yourself.)

Let's start with the concrete question about YYERROR in a GLR parser.

In effect, you can use YYERROR in a semantic predicate in order to disambiguate a parse by rejecting the production it is part of. You cannot do this in a semantic action because semantic actions are not executed until the parser has decided on a unique parse, at which point there is no longer an ambiguity.

Semantic actions can occur anywhere in a rule, and will be immediately executed when they are reduced, even if the reduction is ambiguous. Because they are executed out of sequence, they cannot refer to the value produced by any preceding semantic action (and semantic predicates themselves have no semantic values even though they occupy a stack slot). Also, because they are effectively executed speculatively as part of a production which might not be part of the final parse, they should not mutate the parser state.

The semantic predicate does have access to the lookahead token, if there is one, through the variable yychar. If yychar is neither YYEMPTY nor YYEOF, the associated semantic value and location information are in yylval and yylloc, respectively. (see Lookahead Tokens). Care must be taken before using this feature, though; if there is no ambiguity at that point in the grammar, it is likely that the bison parser will have executed the semantic predicate without reading the lookahead token.

So your understanding was close, but not quite correct:

For efficiency, the GLR parser recognizes when there is no ambiguity in the parse, and uses the normal straightforward shift-reduce algorithm at those points. In most grammars, ambiguities are rare and are resolved rapidly, so this optimisation means that the overhead associated with maintaining multiple alternatives can be avoided for the majority of the parse. In this mode (which won't apply if ambiguity is possible, of course), YYERROR leads to standard error recovery, as in a shift-reduce parser, at the point of reduction of the action.
Since deferred actions are not run until ambiguity has been resolved, YYERROR in a deferred action will effectively be executed as though it were in a deterministic state, as in the previous paragraph. This is true even if there is a custom merge procedure; if a custom merge is in progress, all the alternatives participating in the custom merge will be discarded. Afterwards, the normal error recovery algorithm will continue. I don't think that the error recovery point is "random", but predicting where it will occur is not easy.
As indicated above, semantic predicates can appear anywhere in a rule, and have access to the lookahead token if there is one. (The Bison manual is a bit misleading here: a semantic predicate cannot contain an invocation of YYERROR because YYERROR is a statement, and the contents of the semantic predicate block is an expression. If the value of the semantic predicate is false, then the parser performs the YYERROR action.)

In effect, that means that you could use semantic predicates instead of (or as well as) actions, but basically in the way that you propose:

arithmeticexpression
    : TOK_IDENTIFIER %?{
          dynamic_cast<IntVariable*>(symbol_table[*$1])
      }
stringexpression
    : TOK_IDENTIFIER %?{
          dynamic_cast<StringVariable*>(symbol_table[*$1])
      }

Note: I removed the type declarations from $<stringvalue>1 because:

It's basically unreadable, at least to me,
More importantly, like an explicit cast, it is not typesafe.

You should declare the semantic type of your tokens:

%type <stringvalue> ID

which will allow bison to type-check.

Whether or not using semantic predicates for that purpose is a good idea is another question.

Thank you very much for this great answer, very helpful! I had tested the way you proposed and it had failed, but for a rather stupid reason - I guess my bison version (3.0.4) has a bug - it generates the code like this: if (! (#line 51 "mfcalc.y" /* glr.c:816 */ // the predicate )) YYERROR; and g++ doesn't like the #line behind the if(. When I manually edit the .cpp file and put that #line in the next line, it works. — Algoman, Jun 14 '18 at 08:47
After reading into merge-procedures, I think it would also make sense to allow both kinds, put a nullptr on the stack in all invalid cases and then have a merge-procedure that selects the (single) stack-element that is != nullptr. I would still have to water it down to use Expressions, but with the merges, I could at least avoid having hand-written parsers in the actions... — Algoman, Jun 14 '18 at 09:06
On the other hand, I am unsure whether the merge could happen too early, resolving something in one way, leaving an invalid rest. So I think doing it the way, we have it here now, is probably the best... I only added this step to my makefile: sed -i.bak s/" if (! (#line"/" if (! (\n#line"/g /path/to/parser.cpp — Algoman, Jun 14 '18 at 10:24

score 0 · Accepted Answer · answered Aug 21 '18 at 08:41

The example here works, but when I put it together with the solution for Bison: GLR-parsing of valid expression fails without error message I ran into the following problem:

The variable could be determined by more than one identifier. You could have an array-index or it could be a member of a different object. I modeled this situation by having another non-terminal like

lvalue
    : TOK_IDENTIFIER
    | lvalue '[' arithmeticexpression ']'
    | lvalue '.' TOK_IDENTIFIER

But when I now have

arithmeticexpression : lvalue
stringexpression : lvalue

and (try to) access the object from the non-terminal "lvalue" like above, I get a segmentation fault. So it seems that the method here doesn't work for more complex situations.

What I did instead now is that I access the object in the semantic ACTION and set $$ to nullptr if the variable has the wrong type.

arithmeticexpression
    : lvalue
    {
        $$ = nullptr;
        if(dynamic_cast<IntVariable*>($lvalue->getVariable()))
            $$ = new ArithmeticExpressionIntVariable($lvalue);
    }

Then I had to propagate the nullptr (this is quite a lot of additional code) like this

arithmeticexpression
    ...
    | arithmeticexpression[left] '+' arithmeticexpressionM[right]
    {
        $$ = nullptr;
        if($left && $right)
            $$ = new ArithmeticExpressionPlus($left, $right);
    }

So now, at the

expression
    : arithmeticexpression
    | stringexpression
(note: I have "Expression* expr;" in the "%union{" declaration
and "%type <expression> expr" in the prologue)

I have an ambiguity: the same input text can be parsed in two different ways, but only one of them will have a value != nullptr. At this point, I need a custom merge procedure which basically just selects the non-null value.

For this, I forward-declared it in the prologue of my bison file like this

static Expression* exprMerge (yy::semantic_type x0, yy::semantic_type x1);

and defined it in the epilogue like this

static Expression* exprMerge (YYSTYPE x0, YYSTYPE x1) \
{   
    /* otherwise we'd have a legitimate ambiguity */
    assert(!x0.expr || !x1.expr);
    return x0.expr ? x0.expr : x1.expr;
}

finally, I had to tell bison to use this procedure to resolve the ambiguity, using

expression
    : arithmeticexpression  %merge <exprMerge>
    | stringexpression %merge <exprMerge>

Honestly, I think this is quite a lot of effort, which wouldn't be necessary if bison tested semantic predicates (at least the ones which are behind the rule) when it tries to merge the paths, not to rule out paths before they merge.

But at least this works and is far less effort than collecting the tokens and sorting them manually, which would have been the alternative.

Note: from the lexer, I pass the TOK_IDENTIFIER via a copy on the heap. So far, I had deleted it after copying its content into the LValueIdentifier object in the lvalue : TOK_IDENTIFIER action. This caused another segmentation fault, because the action had to be executed on multiple stacks - I had to leave the string on the heap (producing a memory leak) — Algoman, Aug 21 '18 at 08:47
propagating nullptr's to a merge procedure seems like a better solution than semantic predicates. It certainly doesn't have to be much code; a little template metaprogramming could capture that pattern. — rici, Aug 21 '18 at 18:21
for the string handling, my goto solution is to intern strings in a memo table in the lexer, and then just pass pointers / references without copying. (These strings must be immutable, of course. For me, that's just fine.) The entire memo table needs to live until the parse is done, but it's really not very big; the fact that I never copy strings saves more memory than the occasional unnecessarily retained string. At least, in my apps it works nicely. It also makes string comparison and hashing faster. — rici, Aug 21 '18 at 18:28
I did it with preprocessor macros, but templates would probably be a cleaner solution, yes. — Algoman, Aug 23 '18 at 08:30

Bison: distinguish variables by type / YYERROR in GLR-parsers

2 Answers2

Linked