The basic issue here is the pair of productions shown in State 11: (Note: I used the command-line option --report=all
instead of -v
in order to see the lookahead computation.)
22 postfix_value_expression: unknown_expression . [';', '(', ')', ',']
29 postfix_type_expression: unknown_expression . [';', ')', '*']
Here you can see that the computed lookahead for both postfix_value_expression
or postfix_type_expression
includes ; and ).
The first of these is (somewhat) spurious; it is one of the rare cases in which the LALR algorithm conflates two states which actually have different lookaheads. (This issue is briefly discussed in the bison manual and at more length in most standard parsing textbooks.) If we ask bison to use the Canonical LR algorithm (%define lr.type canonical-lr
), this conflict disappears, but the second one remains (in state 52):
22 postfix_value_expression: unknown_expression . ['(', ')']
29 postfix_type_expression: unknown_expression . [')', '*']
What we have here is a true ambiguity in your grammar. (I'm assuming that "types" can have "value" members (and vice-versa), as the grammar seems to indicate.)
Suppose the input includes
x.member
The parser has no need to know whether x
is a type or a value. x
will simply be parsed as an unknown_expression
(production 13) and the index selection will also be an unknown_expression
(production 14). [Note: I copied the number production rules from the output file at the end of this answer].
But since types can be parenthesized (production 25), the above could legitimately be written:
(x).member
Now the parser needs to apply either production 15 or 16 in order to derive unknown_expression
, which means that it needs to reduce x
using production 22 (to proceed with production 15) or production 29 (for production 16). It does not, however, known whether x
is a type or a value, and it cannot know (at least, without consulting the symbol table). Although the two possible reductions are (presumably) equivalent, they are different productions, so the grammar is ambiguous. Hence the reduce/reduce conflict.
Looked at this way, the conflict is the result of what we might call premature categorization. Since the grammar doesn't actually care whether the left-hand operand of the index operator . is a type or a value, the left-hand operand does not need to be resolved at this point. That will be worked out by semantic analysis in some subsequent step. That being the case, the simplest solution is to not bother with distinguishing between types and values in the grammar, which will work out just fine for this simple language subset. A very simple grammar would suffice:
expression : postfix_expression
postfix_expression: term
| postfix_expression '.' IDENTIFIER
| postfix_expression '(' expression_list ')'
| postfix_expression '*'
term : IDENTIFIER
| '(' expression ')'
and any necessary semantic checks (to ensure that the expression is the right kind) could be done in a posterior tree walk, once all the identifiers have been categorized.
It is useful to ask whether it is ever necessary to grammatically resolve the category of an expression. Is it actually required to disambiguate the parse?
In C (and friends), the language really is ambiguous unless identifiers are categorized. For example, (x)*y
is either:
- a cast of the dereferenced pointer
y
, or
- the product of the scalars
x
and y
A correct parse tree cannot be produced without knowing the category of x
. (To make this even clearer, consider z/(x)*y
, where the shape of the parse tree is radically different between the two alternatives. There are many other examples.)
The usual solution for producing an accurate parse tree for C, is to resort to lexical feedback, which in turn requires declaration before use, at least to the extent of categorization. (Even C++ insists that type aliases be declared before use, even though it allows class members to be used before declaration.)
A reasonable alternative is to use something like a GLR parser, which produces both (or all) parse tree. (This is usually called a parse forest.) A subsequent semantic analysis pass could prune the parse forest once identifiers have been categorized, although it is still necessary that the categorization is unambiguous for all parse trees in the forest. [Note 1]
Avoiding the above ambiguity is mostly a question of language design; had the syntax for C casts used square brackets instead of parentheses, for example, the problem could have been eliminated. (Of course, that would have used up the syntactic space which C++ used for lambdas.) Nonetheless, it is easy to imagine a language whose grammar is unambiguous even without knowing the categories of each identifier, but which nonetheless requires consideration of expression domains because of syntactic divergence. It seems likely that your language will be of this form. (In some sense, this is top-down categorization rather than bottom-up categorization: a declaration and an expression statement have different parsing contexts.)
For example, it is easy to imagine that you anticipate using * in a similar manner to the C family, as both a prefix dereference operator and infix multiplication operator. In combination with the postfix type construction operator shown above, that ends up with a single operator being used for prefix, postfix and infix. If all the uses are tossed indiscriminately into the same bag, that is necessarily ambiguous: a single token can represent at most two of operand, infix operator, prefix operator and postfix operator. If * could be any of the three operator types, then a * * b
could mean either (a*) * b
or a * (*b)
.
But that expression is still possible to parse unambiguously even without knowing the category of identifiers, because we know that neither the prefix nor the infix * operators can be applied to the result of the postfix * type constructor. Consequently, (a*) * b
is impossible, and the only valid parse is a * (*b)
.
Unfortunately, since an LR(1) parser needs to decide whether or not to reduce on the basis of a single lookahead token, the decision as to whether a*
should be reduced to a type
or not needs to be made when the second *
is encountered, which is not possible because (a * *)
is a valid type expression. And worse, there is no limit to the number of asterisks in the expression, which might actually be a*…*[b]
. We can disambiguate based on the token following the last asterisk, but that requires arbitrary lookahead.
Given the above, the best solution is probably a GLR parser, which doesn't have a lookahead limitation, combined with enough syntax to at least divide the contexts in which operators like *
could be used.
We can't just throw a GLR parser at the grammar as presented in the OP, because -- as mentioned above -- it is definitely ambiguous even though the ambiguity is unimportant. Fortunately, we can create an unambiguous grammar by avoiding premature categorization.
The problem with parenthesized expressions, as shown above, is that the grammar requires them to be either type or value expressions, even in cases where it doesn't care which. To avoid that, we need to allow the possibility of delaying the decision until and if it is necessary by creating a third type of parenthesized expression. That results in the following grammar, which is actually LALR(1).
I separated out the expression productions because otherwise the boilerplate is overwhelming. Each production effectively indicates the expected category of the arguments, the grammatical precedence of the operator, and the category of the result (or unknown). The left-hand side of each production is known_category_precedence
, asserting that the grammar requires that category, or unknown_precedence
, indicating that the grammar by itself does not predict a category. (For example, '(expr).b' might be a type or a value, so the select operator's result is unknown_postfix
.)
On the left-hand side, you will see non-terminals of the form category_precedence
, which asserts that the semantic analysis must result in an object with that category. These non-terminals are really just for convenience; each one is defined simply as
category_precedence: known_category_precedence | unknown_precedence
in which the second alternative indicates the need for a semantic check. If that were possible during the parse, it would be inserted in the action for the second alternative. Alternatively, a semantic check node could be inserted into the AST.
I included these convenience productions in the boiler-plate but most of them are commented out because bison complains when a non-terminal is not used anywhere in the grammar.
%token VAR "var"
%token IDENTIFIER TYPE NUMBER
%%
program : %empty
| program statement
statement : var_declaration
| expr_statement
var_declaration : "var" IDENTIFIER ':' type ';'
expr_statement : value ';'
/* Productions for expression syntaxes */
parameters : '(' ')'
| '(' value_list ')'
value_list : value
| value_list ',' value
known_value_postfix: value_postfix parameters { /* function call */ }
unknown_postfix : any_postfix '.' IDENTIFIER { /* select */ }
known_type_postfix : type_postfix '*' { /* pointer type constructor */ }
/* Primary terms */
unknown_primary : IDENTIFIER
| '(' unknown ')'
known_value_primary: NUMBER
| '(' known_value ')'
known_type_primary : TYPE
| '(' known_type ')'
/* Boilerplate precedence grammar with two infix precedence levels */
unknown_postfix : unknown_primary
known_value_postfix: known_value_primary
known_type_postfix : known_type_primary
value_postfix : known_value_postfix | unknown_postfix
type_postfix : known_type_postfix | unknown_postfix
any_postfix : known_value_postfix | known_type_postfix | unknown_postfix
unknown_prefix : unknown_postfix
known_value_prefix : known_value_postfix
known_type_prefix : known_type_postfix
/* value_prefix : known_value_prefix | unknown_prefix */
/* type_postfix : known_type_prefix | unknown_prefix */
unknown_infix9 : unknown_prefix
known_value_infix9 : known_value_prefix
known_type_infix9 : known_type_prefix
/* value_infix9 : known_value_infix9 | unknown_infix9 */
/* type_infix9 : known_type_infix9 | unknown_infix9 */
unknown_infix8 : unknown_infix9
known_value_infix8 : known_value_infix9
known_type_infix8 : known_type_infix9
/* value_infix9 : known_value_infix8 | unknown_infix8 */
/* type_infix9 : known_type_infix8 | unknown_infix8 */
/* The last stanza is mostly for convenience but it also serves
* to avoid the spurious reduce/reduce conflict on ';'.
*/
unknown : unknown_infix8
known_value : known_value_infix8
known_type : known_type_infix8
value : known_value | unknown
type : known_type | unknown
With that framework in place, we can start adding other types of expression productions.
First, let's confront the three-way conflict on *. The basic expression productions are quite simple, based on the pattern described above:
known_value_prefix : '*' value_prefix { /* dereference */ }
known_value_infix9 : value_infix9 '*' value_prefix { /* produce */ }
(We also need to uncomment the value_prefix
and value_infix9
productions.)
Although the language is still unambiguous, it is no longer LALR(1) (or even LR(k) for any k), as indicated above in the discussion of this syntax. So bison will complain that there is a reduce-reduce conflict. We can't fix that complaint, but we can easily produce a working parser; all we need to do is insert a request that bison generate a GLR parser:
%glr-parser
That doesn't suppress the conflict warning [Note 2], but it does produce a working parser. (bison cannot verify that the grammar is unambiguous because there is no precise algorithm for doing so.) Since we cannot prove the grammar is unambiguous, we need to do extensive testing. If the GLR parser encounters an ambiguity, it will produce an error message, so we can be reassured when we don't see any error:
$ ./typevar3
(*a).b;
[EXP [ASVALUE [SELECT [DEREF [ASVALUE a]] b]]]
a*b;
[EXP [PROD [ASVALUE a] [ASVALUE b]]]
a**b;
[EXP [PROD [ASVALUE a] [DEREF [ASVALUE b]]]]
(a***b).c;
[EXP [ASVALUE [SELECT [PROD [ASVALUE a] [DEREF [DEREF [ASVALUE b]]]] c]]]
(a***).c;
[EXP [ASVALUE [SELECT [POINTER [POINTER [POINTER [ASTYPE a]]]] c]]]
Now, let's suppose we want to add an array type constructor, somewhat similar to C. We'll allow the form:
type[count] // eg. int[3]
Note that the form is syntactically similar to an array index operation (a[2]), which we'll also add to the language.
This is a slightly different case from the select operator. In the case of the select operator, the grammar permits either values or types to be selected from, and cannot predict the result. In the case of array constructing/indexing, the category of the result is precisely the category of the first argument.
Because we need to "pass through" the category of the first argument, we need three productions, rather like the parenthesis productions:
known_value_postfix: known_value_postfix '[' value ']'
known_type_postfix : known_type_postfix '[' value ']'
unknown_postfix : unknown_postfix '[' value ']'
In the third case, we'll need to insert an "construct_or_index" node into the AST. That could be resolved into a construct node or an index node by a later unit production which converts an unknown category into some specific category, or it could be left for the semantic analysis phase.
Adding these three productions produces no problems. However, now we would like to add a syntax for constructing variable-sized arrays, and we will choose the syntax int[*]
, generating yet another incompatible use of the * lexeme. That syntax has a definite result category, since it is not a valid index expression, so we can write the production:
known_type_postfix : type_postfix '[' '*' ']'
This time, we chose to use a convenience non-terminal on the right-hand side. Of course, that produces a raft of new grammatical conflicts, both shift-reduce and reduce-reduce, but the grammar continues to work as desired:
var a: (int*)[*];
[DECL a [ARRAY [POINTER [TYPE int]]]]
var a: (int*)[2*2];
[DECL a [ARRAY [POINTER [TYPE int]] [PROD [VALUE 2] [VALUE 2]]]]
a[2];
[EXP [ASVALUE [INDEX_OR_ARRAY a [VALUE 2]]]]
(*a)[2];
[EXP [INDEX [DEREF [ASVALUE a]] [VALUE 2]]]
Left as an exercise: use the infix + operator to represent the scalar addition of two values, and the union of two types. Note that you will have to deal with the cases where just one of the two operands has a known category (and therefore the other operand must be coerced into the same category), as well as the case where neither operand has a known category.
Notes
The GLR parser implemented by bison is not ideal for this purpose because it wants to produce a single parse tree. It is possible to manually create a parse forest with the bison-generated parser by using %merge
declarations (as described in the manual), but this does not implement the space-efficient graph-representation which other GLR parsers can produce.
The bison manual suggests using the %expect-rr
directive to do that, but IMHO that should only be done once the grammar is production-ready. Unfortunately, you can only suppress a known count of conflicts, rather than suppressing particular expected conflicts; I suppose this is because it is difficult to precisely describe a single expected conflict, but the end result is that suppressing conflicts makes it easy to miss issues while you are developing the grammar. Verifying that the conflicts are the expected ones is annoying, but less annoying than missing an error in the grammar.
The original grammar with numbered productions:
1 root: code_statements
2 code_statements: code_statement
3 | code_statements code_statement
4 code_statement: var_statement
5 | expression_statement
6 var_decl: IDENTIFIER ':' type_expression
7 var_statement: VAR var_decl ';'
8 expression_statement: value_expression ';'
9 function_call: '(' ')'
10 | '(' value_expression_list ')'
11 value_expression_list: value_expression
12 | value_expression_list ',' value_expression
13 unknown_expression: IDENTIFIER
14 | unknown_expression '.' IDENTIFIER
15 | postfix_known_value_expression '.' IDENTIFIER
16 | postfix_known_type_expression '.' IDENTIFIER
17 primary_value_expression: NUMBER
18 | '(' value_expression ')'
19 postfix_known_value_expression: primary_value_expression
20 | postfix_value_expression function_call
21 postfix_value_expression: postfix_known_value_expression
22 | unknown_expression
23 value_expression: postfix_value_expression
24 primary_type_expression: INT
25 | '(' type_expression ')'
26 postfix_known_type_expression: primary_type_expression
27 | postfix_type_expression '*'
28 postfix_type_expression: postfix_known_type_expression
29 | unknown_expression
30 type_expression: postfix_type_expression