0

I have a particular construction that I'm having trouble debugging. I hope this aggressive simplification demonstrates the issue properly:

%token VAR IDENTIFIER NUMBER INT

%%
root:
    code_statements
    ;

code_statements:
    code_statement
    | code_statements code_statement
    ;

code_statement:
    var_statement
    | expression_statement
    ;

var_decl:
    IDENTIFIER ':' type_expression
    ;
var_statement:
    VAR var_decl ';'
    ;

expression_statement:
    value_expression ';'
    ;

function_call:
    '(' ')'
    | '(' value_expression_list ')'
    ;

value_expression_list:
    value_expression
    | value_expression_list ',' value_expression
    ;

/**** UNKNOWN EXPRESSIONS (VALUE/TYPE?) ****/

unknown_expression:
    IDENTIFIER
    | unknown_expression '.' IDENTIFIER
    | postfix_known_value_expression '.' IDENTIFIER
    | postfix_known_type_expression '.' IDENTIFIER
    ;

/**** VALUE EXPRESSION ****/

primary_value_expression:
    NUMBER
    | '(' value_expression ')'
    ;

postfix_known_value_expression:
    primary_value_expression
    | postfix_value_expression function_call
    ;

postfix_value_expression:
    postfix_known_value_expression
    | unknown_expression
    ;

value_expression:
    postfix_value_expression
    ;


/**** TYPE EXPRESSION ****/

primary_type_expression:
    INT
    | '(' type_expression ')'
    ;

postfix_known_type_expression:
    primary_type_expression
    | postfix_type_expression '*'
    ;

postfix_type_expression:
    postfix_known_type_expression
    | unknown_expression
    ;

type_expression:
    postfix_type_expression
    ;

%%

Error output:

State 11

   14 unknown_expression: unknown_expression . '.' IDENTIFIER
   22 postfix_value_expression: unknown_expression .
   29 postfix_type_expression: unknown_expression .

    '.'  shift, and go to state 26

    ';'       reduce using rule 22 (postfix_value_expression)
    ';'       [reduce using rule 29 (postfix_type_expression)]
    ')'       reduce using rule 22 (postfix_value_expression)
    ')'       [reduce using rule 29 (postfix_type_expression)]
    '*'       reduce using rule 29 (postfix_type_expression)
    $default  reduce using rule 22 (postfix_value_expression)

There are 2 problems here, the one related to the ) is easy to understand, but I'm struggling to think of an elegant way to solve it. The issue on the ;, I can't quite see how it happens (although I do know how to affect it), and I'm not sure how to tease the relevant information from the error to debug the actual problem.

In my actual grammar, there are a few more, but I think the patterns are basically the same as these 2.

Manu Evans
  • 1,088
  • 2
  • 9
  • 25
  • When I prefix what you provide with `%token INT IDENTIFIER NUMBER` and `%%`, I get an error that `type_expression` is an unused non-terminal. When I comment that out, the grammar compiles without any R/R conflict. That's a 'cannot reproduce'. Please go back to your drawing board, and re-create your MCVE ([MCVE]) so that the grammar you post does reproduce the problem. JFTR, I used Bison 2.3 (2006) on macOS Sierra 10.12.1. I'm not sure whether other versions produce different errors. – Jonathan Leffler Dec 12 '16 at 06:56
  • Crap >_< .. it's really big and hard to reduce without heaps of cruft. I really should have tried to run bison on what I posted though huh... I'll update with the actual problem... – Manu Evans Dec 12 '16 at 09:21
  • Did edit, sorry for the noise. – Manu Evans Dec 12 '16 at 10:31

1 Answers1

3

The basic issue here is the pair of productions shown in State 11: (Note: I used the command-line option --report=all instead of -v in order to see the lookahead computation.)

22 postfix_value_expression: unknown_expression .  [';', '(', ')', ',']
29 postfix_type_expression: unknown_expression .  [';', ')', '*']

Here you can see that the computed lookahead for both postfix_value_expression or postfix_type_expression includes ; and ).

The first of these is (somewhat) spurious; it is one of the rare cases in which the LALR algorithm conflates two states which actually have different lookaheads. (This issue is briefly discussed in the bison manual and at more length in most standard parsing textbooks.) If we ask bison to use the Canonical LR algorithm (%define lr.type canonical-lr), this conflict disappears, but the second one remains (in state 52):

22 postfix_value_expression: unknown_expression .  ['(', ')']
29 postfix_type_expression: unknown_expression .  [')', '*']

What we have here is a true ambiguity in your grammar. (I'm assuming that "types" can have "value" members (and vice-versa), as the grammar seems to indicate.)

Suppose the input includes

x.member

The parser has no need to know whether x is a type or a value. x will simply be parsed as an unknown_expression (production 13) and the index selection will also be an unknown_expression (production 14). [Note: I copied the number production rules from the output file at the end of this answer].

But since types can be parenthesized (production 25), the above could legitimately be written:

(x).member

Now the parser needs to apply either production 15 or 16 in order to derive unknown_expression, which means that it needs to reduce x using production 22 (to proceed with production 15) or production 29 (for production 16). It does not, however, known whether x is a type or a value, and it cannot know (at least, without consulting the symbol table). Although the two possible reductions are (presumably) equivalent, they are different productions, so the grammar is ambiguous. Hence the reduce/reduce conflict.

Looked at this way, the conflict is the result of what we might call premature categorization. Since the grammar doesn't actually care whether the left-hand operand of the index operator . is a type or a value, the left-hand operand does not need to be resolved at this point. That will be worked out by semantic analysis in some subsequent step. That being the case, the simplest solution is to not bother with distinguishing between types and values in the grammar, which will work out just fine for this simple language subset. A very simple grammar would suffice:

expression        : postfix_expression
postfix_expression: term
                  | postfix_expression '.' IDENTIFIER
                  | postfix_expression '(' expression_list ')'
                  | postfix_expression '*'
term              : IDENTIFIER
                  | '(' expression ')'

and any necessary semantic checks (to ensure that the expression is the right kind) could be done in a posterior tree walk, once all the identifiers have been categorized.

It is useful to ask whether it is ever necessary to grammatically resolve the category of an expression. Is it actually required to disambiguate the parse?

In C (and friends), the language really is ambiguous unless identifiers are categorized. For example, (x)*y is either:

  • a cast of the dereferenced pointer y, or
  • the product of the scalars x and y

A correct parse tree cannot be produced without knowing the category of x. (To make this even clearer, consider z/(x)*y, where the shape of the parse tree is radically different between the two alternatives. There are many other examples.)

The usual solution for producing an accurate parse tree for C, is to resort to lexical feedback, which in turn requires declaration before use, at least to the extent of categorization. (Even C++ insists that type aliases be declared before use, even though it allows class members to be used before declaration.)

A reasonable alternative is to use something like a GLR parser, which produces both (or all) parse tree. (This is usually called a parse forest.) A subsequent semantic analysis pass could prune the parse forest once identifiers have been categorized, although it is still necessary that the categorization is unambiguous for all parse trees in the forest. [Note 1]

Avoiding the above ambiguity is mostly a question of language design; had the syntax for C casts used square brackets instead of parentheses, for example, the problem could have been eliminated. (Of course, that would have used up the syntactic space which C++ used for lambdas.) Nonetheless, it is easy to imagine a language whose grammar is unambiguous even without knowing the categories of each identifier, but which nonetheless requires consideration of expression domains because of syntactic divergence. It seems likely that your language will be of this form. (In some sense, this is top-down categorization rather than bottom-up categorization: a declaration and an expression statement have different parsing contexts.)

For example, it is easy to imagine that you anticipate using * in a similar manner to the C family, as both a prefix dereference operator and infix multiplication operator. In combination with the postfix type construction operator shown above, that ends up with a single operator being used for prefix, postfix and infix. If all the uses are tossed indiscriminately into the same bag, that is necessarily ambiguous: a single token can represent at most two of operand, infix operator, prefix operator and postfix operator. If * could be any of the three operator types, then a * * b could mean either (a*) * b or a * (*b).

But that expression is still possible to parse unambiguously even without knowing the category of identifiers, because we know that neither the prefix nor the infix * operators can be applied to the result of the postfix * type constructor. Consequently, (a*) * b is impossible, and the only valid parse is a * (*b).

Unfortunately, since an LR(1) parser needs to decide whether or not to reduce on the basis of a single lookahead token, the decision as to whether a* should be reduced to a type or not needs to be made when the second * is encountered, which is not possible because (a * *) is a valid type expression. And worse, there is no limit to the number of asterisks in the expression, which might actually be a*…*[b]. We can disambiguate based on the token following the last asterisk, but that requires arbitrary lookahead.

Given the above, the best solution is probably a GLR parser, which doesn't have a lookahead limitation, combined with enough syntax to at least divide the contexts in which operators like * could be used.

We can't just throw a GLR parser at the grammar as presented in the OP, because -- as mentioned above -- it is definitely ambiguous even though the ambiguity is unimportant. Fortunately, we can create an unambiguous grammar by avoiding premature categorization.

The problem with parenthesized expressions, as shown above, is that the grammar requires them to be either type or value expressions, even in cases where it doesn't care which. To avoid that, we need to allow the possibility of delaying the decision until and if it is necessary by creating a third type of parenthesized expression. That results in the following grammar, which is actually LALR(1).

I separated out the expression productions because otherwise the boilerplate is overwhelming. Each production effectively indicates the expected category of the arguments, the grammatical precedence of the operator, and the category of the result (or unknown). The left-hand side of each production is known_category_precedence, asserting that the grammar requires that category, or unknown_precedence, indicating that the grammar by itself does not predict a category. (For example, '(expr).b' might be a type or a value, so the select operator's result is unknown_postfix.)

On the left-hand side, you will see non-terminals of the form category_precedence, which asserts that the semantic analysis must result in an object with that category. These non-terminals are really just for convenience; each one is defined simply as

category_precedence: known_category_precedence | unknown_precedence

in which the second alternative indicates the need for a semantic check. If that were possible during the parse, it would be inserted in the action for the second alternative. Alternatively, a semantic check node could be inserted into the AST.

I included these convenience productions in the boiler-plate but most of them are commented out because bison complains when a non-terminal is not used anywhere in the grammar.

%token VAR "var"
%token IDENTIFIER TYPE NUMBER
%%
program            : %empty
                   | program statement
statement          : var_declaration
                   | expr_statement
var_declaration    : "var" IDENTIFIER ':' type ';'
expr_statement     : value ';'
 /* Productions for expression syntaxes */
parameters         : '(' ')'
                   | '(' value_list ')'
value_list         : value
                   | value_list ',' value
known_value_postfix: value_postfix parameters    { /* function call */ }
unknown_postfix    : any_postfix '.' IDENTIFIER  { /* select */ }
known_type_postfix : type_postfix '*'            { /* pointer type constructor */ }

 /* Primary terms */
unknown_primary    : IDENTIFIER
                   | '(' unknown ')'
known_value_primary: NUMBER
                   | '(' known_value ')'
known_type_primary : TYPE
                   | '(' known_type ')'
 /* Boilerplate precedence grammar with two infix precedence levels */
unknown_postfix    : unknown_primary
known_value_postfix: known_value_primary
known_type_postfix : known_type_primary
value_postfix      : known_value_postfix | unknown_postfix
type_postfix       : known_type_postfix  | unknown_postfix
any_postfix        : known_value_postfix | known_type_postfix | unknown_postfix

unknown_prefix     : unknown_postfix
known_value_prefix : known_value_postfix
known_type_prefix  : known_type_postfix
/* value_prefix    : known_value_prefix | unknown_prefix */
/* type_postfix    : known_type_prefix  | unknown_prefix */

unknown_infix9     : unknown_prefix
known_value_infix9 : known_value_prefix
known_type_infix9  : known_type_prefix
/* value_infix9    : known_value_infix9 | unknown_infix9 */
/* type_infix9     : known_type_infix9  | unknown_infix9 */

unknown_infix8     : unknown_infix9
known_value_infix8 : known_value_infix9
known_type_infix8  : known_type_infix9
/* value_infix9    : known_value_infix8 | unknown_infix8 */
/* type_infix9     : known_type_infix8  | unknown_infix8 */

/* The last stanza is mostly for convenience but it also serves
 * to avoid the spurious reduce/reduce conflict on ';'.
 */
unknown            : unknown_infix8
known_value        : known_value_infix8
known_type         : known_type_infix8
value              : known_value | unknown
type               : known_type | unknown

With that framework in place, we can start adding other types of expression productions.

First, let's confront the three-way conflict on *. The basic expression productions are quite simple, based on the pattern described above:

known_value_prefix : '*' value_prefix            { /* dereference */ }
known_value_infix9 : value_infix9 '*' value_prefix { /* produce */ }

(We also need to uncomment the value_prefix and value_infix9 productions.)

Although the language is still unambiguous, it is no longer LALR(1) (or even LR(k) for any k), as indicated above in the discussion of this syntax. So bison will complain that there is a reduce-reduce conflict. We can't fix that complaint, but we can easily produce a working parser; all we need to do is insert a request that bison generate a GLR parser:

%glr-parser

That doesn't suppress the conflict warning [Note 2], but it does produce a working parser. (bison cannot verify that the grammar is unambiguous because there is no precise algorithm for doing so.) Since we cannot prove the grammar is unambiguous, we need to do extensive testing. If the GLR parser encounters an ambiguity, it will produce an error message, so we can be reassured when we don't see any error:

$ ./typevar3
(*a).b;
[EXP [ASVALUE [SELECT [DEREF [ASVALUE a]] b]]]
a*b;
[EXP [PROD [ASVALUE a] [ASVALUE b]]]
a**b;
[EXP [PROD [ASVALUE a] [DEREF [ASVALUE b]]]]
(a***b).c;
[EXP [ASVALUE [SELECT [PROD [ASVALUE a] [DEREF [DEREF [ASVALUE b]]]] c]]]
(a***).c;
[EXP [ASVALUE [SELECT [POINTER [POINTER [POINTER [ASTYPE a]]]] c]]]

Now, let's suppose we want to add an array type constructor, somewhat similar to C. We'll allow the form:

type[count]  // eg. int[3]

Note that the form is syntactically similar to an array index operation (a[2]), which we'll also add to the language.

This is a slightly different case from the select operator. In the case of the select operator, the grammar permits either values or types to be selected from, and cannot predict the result. In the case of array constructing/indexing, the category of the result is precisely the category of the first argument.

Because we need to "pass through" the category of the first argument, we need three productions, rather like the parenthesis productions:

known_value_postfix: known_value_postfix '[' value ']'
known_type_postfix : known_type_postfix '[' value ']'
unknown_postfix    : unknown_postfix '[' value ']'

In the third case, we'll need to insert an "construct_or_index" node into the AST. That could be resolved into a construct node or an index node by a later unit production which converts an unknown category into some specific category, or it could be left for the semantic analysis phase.

Adding these three productions produces no problems. However, now we would like to add a syntax for constructing variable-sized arrays, and we will choose the syntax int[*], generating yet another incompatible use of the * lexeme. That syntax has a definite result category, since it is not a valid index expression, so we can write the production:

known_type_postfix : type_postfix '[' '*' ']'

This time, we chose to use a convenience non-terminal on the right-hand side. Of course, that produces a raft of new grammatical conflicts, both shift-reduce and reduce-reduce, but the grammar continues to work as desired:

var a: (int*)[*];
[DECL a [ARRAY [POINTER [TYPE int]]]]
var a: (int*)[2*2];        
[DECL a [ARRAY [POINTER [TYPE int]] [PROD [VALUE 2] [VALUE 2]]]]
a[2];
[EXP [ASVALUE [INDEX_OR_ARRAY a [VALUE 2]]]]
(*a)[2]; 
[EXP [INDEX [DEREF [ASVALUE a]] [VALUE 2]]]

Left as an exercise: use the infix + operator to represent the scalar addition of two values, and the union of two types. Note that you will have to deal with the cases where just one of the two operands has a known category (and therefore the other operand must be coerced into the same category), as well as the case where neither operand has a known category.


Notes

  1. The GLR parser implemented by bison is not ideal for this purpose because it wants to produce a single parse tree. It is possible to manually create a parse forest with the bison-generated parser by using %merge declarations (as described in the manual), but this does not implement the space-efficient graph-representation which other GLR parsers can produce.

  2. The bison manual suggests using the %expect-rr directive to do that, but IMHO that should only be done once the grammar is production-ready. Unfortunately, you can only suppress a known count of conflicts, rather than suppressing particular expected conflicts; I suppose this is because it is difficult to precisely describe a single expected conflict, but the end result is that suppressing conflicts makes it easy to miss issues while you are developing the grammar. Verifying that the conflicts are the expected ones is annoying, but less annoying than missing an error in the grammar.

  3. The original grammar with numbered productions:

     1 root: code_statements
     2 code_statements: code_statement
     3                | code_statements code_statement
     4 code_statement: var_statement
     5               | expression_statement
     6 var_decl: IDENTIFIER ':' type_expression
     7 var_statement: VAR var_decl ';'
     8 expression_statement: value_expression ';'
     9 function_call: '(' ')'
    10              | '(' value_expression_list ')'
    11 value_expression_list: value_expression
    12                      | value_expression_list ',' value_expression
    13 unknown_expression: IDENTIFIER
    14                   | unknown_expression '.' IDENTIFIER
    15                   | postfix_known_value_expression '.' IDENTIFIER
    16                   | postfix_known_type_expression '.' IDENTIFIER
    17 primary_value_expression: NUMBER
    18                         | '(' value_expression ')'
    19 postfix_known_value_expression: primary_value_expression
    20                               | postfix_value_expression function_call
    21 postfix_value_expression: postfix_known_value_expression
    22                         | unknown_expression
    23 value_expression: postfix_value_expression
    24 primary_type_expression: INT
    25                        | '(' type_expression ')'
    26 postfix_known_type_expression: primary_type_expression
    27                              | postfix_type_expression '*'
    28 postfix_type_expression: postfix_known_type_expression
    29                        | unknown_expression
    30 type_expression: postfix_type_expression
    
rici
  • 234,347
  • 28
  • 237
  • 341
  • Thanks for this answer. The first issue where you suggest to use `%define lr.type canonical-lr` is the issue I'm really struggling with, and I tried to apply your solution to my actual grammar, but that changes the result from 5 reduce/reduce conflicts, to 8. They're a totally different set of conflicts though. More on this? The second issue where both value and type have a parenthesis form; I understand the problem, I'm looking for creative ways to allow that construct. Value/type sepraration grammatically diverge, but the common 'unknown' section can be expanded to include more expressions. – Manu Evans Dec 13 '16 at 00:55
  • Your `(x)*y` example is very interesting food-for-thought. I almost certainly will run into issues like that, although the grammar of the language itself is flexible, I can probably work around most cases like this... although I think this is going to be a recurring problem. Are there generalised solutions? It kinda feels like defeat to have the grammar require in-progress semantic to parse. Requires a non-linear parse; support forward-referencing etc. Or is that a reasonable/accepted solution? – Manu Evans Dec 13 '16 at 01:05
  • @ManuEvans: I added a grammar which handles your excerpt without conflicts (assuming I did the copy/paste correctly); I have some more operators to add to demonstrate some other features but once again I ran out of time... Getting there, though. – rici Dec 15 '16 at 00:22
  • Thanks so much for the detail in this answer, it will take me a while to digest. – Manu Evans Dec 15 '16 at 05:59
  • The grammatical solutions you presented here are actually exactly what I came up with last night while I was working on this following your initial reply prior to your edit. Keeping an unknown tree, and only applying valye/type-ness at the time the expression is unambiguously a value or type. Can you recommend alternative GLR parser generators? I'm not married to bison, it just appeared to be the accepted standard. – Manu Evans Dec 15 '16 at 06:43
  • @manu: i use bison and it has always worked fine for me. I've never actually used the strategy of building a complete parse forest prior to semantic analysis but I've seen it done with proprietary GLR parsers. Tomorrow I'll post the rest of my sample grammar which handles the three-way asterisk problem using bison's GLR parser. – rici Dec 15 '16 at 06:49
  • Great, I'm very interested to see. It's precisely where my own hacking broke down last night... although I don't have C-like dereference prefix in this language, I still got conflicts between `type*` and `a*b` which I'm trying to get my head around. – Manu Evans Dec 15 '16 at 06:52
  • Yeah okay, I spent most of today trying to resolve the ambiguity between `type_postfix '*'` and `value_infix2 '*' value_infix1`. I'm very interested to hear about your solution for these awkward cases. I have a few of these emerged. – Manu Evans Dec 15 '16 at 16:23
  • @ManuEvans: OK, added. The grammar does actually work (modulo possible copy&paste issues) as seen by the example runs. I could show the actual code used to produce those outputs; it's pretty simple. But once again, I have to run. – rici Dec 15 '16 at 22:58
  • My language is compiling correctly! Thanks so much for all your help! – Manu Evans Dec 16 '16 at 14:13