0

I am writing a grammar for a complete programming language of my own design. This language has several types of expressions that are combined in different ways in different situations. I have a pretty good idea of how I want it to work, but I am having trouble with factoring out the shift/reduce and reduce/reduce conflicts. I am using Bison v3.0.4 under Xubuntu 16.04. The full grammar (including the *.output file) can be seen in my github at https://github.com/chucktilbury/Simple1 (see expressions.y and expressions.output)

I have gotten pretty far with it. I know it's not the best, but I am learning. If someone could give some pointers to help me get unstuck, I would appreciate it.

Here is a snip of the part of the grammar that is giving me problems:

%{
#include <stdio.h>
%}

%token OPAREN_TOK CPAREN_TOK OCURLY_TOK CCURLY_TOK OBOX_TOK CBOX_TOK
%token COMMA_TOK SCOLON_TOK DOT_TOK COLON_TOK 
%token CLASS_TOK FUNC_TOK PRIVATE_TOK PUBLIC_TOK PROTECTED_TOK
%token CREATE_TOK DESTROY_TOK IMPORT_TOK STRUCT_TOK

%token PLUS_TOK MINUS_TOK MULT_TOK DIV_TOK MODULO_TOK ASSIGN_TOK 

%token BIT_NOT_TOK BIT_OR_TOK BIT_AND_TOK BIT_XOR_TOK BIT_LSH_TOK BIT_RSH_TOK

%token INT_TOK FLOAT_TOK UNSD_TOK STRG_TOK
%token BOOL_TOK 

%token RETURN_TOK BREAK_TOK CONT_TOK IF_TOK ELSE_TOK WHILE_TOK
%token FOR_TOK SWITCH_TOK CASE_TOK 

%token OR_TOK AND_TOK NOT_TOK EQ_TOK GEQ_TOK LEQ_TOK
%token NEQ_TOK MORE_TOK LESS_TOK 

%token TRUE_TOK FALSE_TOK NOTHING_TOK

%token SYMBOL_TOK UNSIGNED_TOK INTEGER_TOK FLOATING_TOK STRING_TOK

%left MINUS_TOK PLUS_TOK
%left MULT_TOK DIV_TOK
%left NEGATION
%right CARAT_TOK    /* exponentiation        */

%%

expression
    : arithmetic_expression
    | boolean_expression
    | bitwise_expression
    ;

compound_symbol
    : SYMBOL_TOK
    | compound_symbol DOT_TOK SYMBOL_TOK
    ;

exponent_numeric_value
    : FLOATING_TOK
    | INTEGER_TOK
    ;

arithmetic_factor
    : INTEGER_TOK
    | FLOAT_TOK
    | UNSIGNED_TOK
    | exponent_numeric_value CARAT_TOK exponent_numeric_value
    | compound_symbol
    ;

arithmetic_expression
    : arithmetic_factor
    | arithmetic_expression PLUS_TOK arithmetic_expression
    | arithmetic_expression MINUS_TOK arithmetic_expression
    | arithmetic_expression MULT_TOK arithmetic_expression
    | arithmetic_expression DIV_TOK arithmetic_expression
    | MINUS_TOK arithmetic_expression %prec NEGATION
    | OPAREN_TOK arithmetic_expression CPAREN_TOK
    ;

boolean_factor
    : arithmetic_factor
    | TRUE_TOK
    | FALSE_TOK
    | STRING_TOK
    ;

boolean_expression
    : boolean_factor
    | boolean_expression OR_TOK boolean_expression
    | boolean_expression AND_TOK boolean_expression 
    | boolean_expression EQ_TOK boolean_expression 
    | boolean_expression NEQ_TOK boolean_expression 
    | boolean_expression LEQ_TOK boolean_expression 
    | boolean_expression GEQ_TOK boolean_expression 
    | boolean_expression MORE_TOK boolean_expression 
    | boolean_expression LESS_TOK boolean_expression 
    | NOT_TOK boolean_expression 
    | OPAREN_TOK boolean_expression CPAREN_TOK
    ;

bitwise_factor
    : INTEGER_TOK
    | UNSIGNED_TOK
    | compound_symbol
    ;

bitwise_expression
    : bitwise_factor
    | bitwise_expression BIT_AND_TOK bitwise_expression
    | bitwise_expression BIT_OR_TOK bitwise_expression
    | bitwise_expression BIT_XOR_TOK bitwise_expression
    | bitwise_expression BIT_LSH_TOK bitwise_expression
    | bitwise_expression BIT_RSH_TOK bitwise_expression
    | BIT_NOT_TOK bitwise_expression
    | OPAREN_TOK bitwise_expression CPAREN_TOK 
    ; 
%% 

This yields 102 shift-reduce and 8 reduce-reduce conflicts. I get that I have some of the tokens reused in rules and the root non-terminal is contrived. I am having trouble figuring out how to organize them so that the correct (sometimes the same) types are associated with the correct type of expression. I have tried reorganizing in various ways. I think it's clear that I am missing something. Maybe my whole approach is all wrong, but I am unclear what the correct approach would be for this.

For a better (but very incomplete) explanation of what I am really trying to do, see the readme on this repository: https://github.com/chucktilbury/toi

  • Well, I figured out part of the answer. I added the %nonassoc keyword for the boolean and bitwise operators. I am pretty sure that's what I want. Still have the 8 reduce-reduce conflicts. – Chuck Tilbury Sep 11 '18 at 04:46
  • Why do you feel the need to have three different expression types? If it is to try to embed type checking in the grammar, that's basically a lost cause. Do a type analysis pass on the AST (possibly as you're building it, but the code is usually clearer if you do it after the parse). If it is because you think you need to, you don't. Just organize your operators by precedence, either with precedence declarations or by using one non-terminal per precedence level. – rici Sep 11 '18 at 07:44
  • Interesting. The reason that I have 3 separate expression definitions is that I want them not to mix. I want the expressions to be robust, such as if('this' == 'that') but I don't want to allow things like if(2+2 > 8). It would have to be defined as x=2+2; if(x>8). Does that seem reasonable? I think that to be a little more clear and not very burdensome. What do you think? – Chuck Tilbury Sep 11 '18 at 21:15
  • Honestly, I would hate it. Naming things is a chore and working out a good name for a temporary value I will use exactly once strikes me as a massive waste of my intellectual resources. Have you really never wrotten, eg.., `if (x % 2 == 1) ...`? Is that really clearer written as `xModulo2 = x % 2; if (xModulo2 == 1)...`? And if you think it is clearer why do you allow `if (x > 2 and x < n)` instead of insisting on `xIsBigEnough = x > 2; xIsNotTooBig = x < n; if (xIsBigEnough and xIsNotTooBig)...`. Anyway, it's your language; I don't have to like it. But you asked... – rici Sep 11 '18 at 22:36
  • Anyway, getting back to the point. If you want to say that the operand of a comparison must be a syntactic primitive, that's easily doable, regardless of my opinion about its utility. But trying to syntactically restrict the arguments of a comparison operator to be the same type is a lost cause, because type analysis is not syntactic. The parser should focus on syntax; it's not required to and should not attempt to flag semantic errors. Otherwise you risk falling into the Cobol trap, where you cannot even parse some expressions without knowing the types of the variable used. – rici Sep 11 '18 at 22:43
  • Great point. I was planning to do semantic checking anyhow. It seemed reasonable to try to restrict some of that using the syntax. So it seems like you are saying that an expression is just an expression syntactically and things like type compatibility checking ought to be done outside of the parser. All of those operators really do the same thing from a syntax point of view. Do I read you correctly? I can see where one might hate having to build a complex expression using individual statements, too. Thanks for that feedback. – Chuck Tilbury Sep 11 '18 at 23:48
  • Yup, that's exactly what I meant. Good luck with the project. – rici Sep 11 '18 at 23:50

1 Answers1

0

If you haven't already, run bison in a form like bison -r all filename.y, and look at the extra output file filename.output. Near the top, this gives me

State 9 conflicts: 2 reduce/reduce
State 10 conflicts: 2 reduce/reduce
State 14 conflicts: 2 reduce/reduce
State 16 conflicts: 2 reduce/reduce
State 35 conflicts: 5 shift/reduce
State 38 conflicts: 8 shift/reduce
...

The next instance of 'State 9' is

State 9

   10 arithmetic_factor: UNSIGNED_TOK .  [$end, CPAREN_TOK, PLUS_TOK, MINUS_TOK, MULT_TOK, DIV_TOK, OR_TOK, AND_TOK, EQ_TOK, GEQ_TOK, LEQ_TOK, NEQ_TOK, MORE_TOK, LESS_TOK]
   36 bitwise_factor: UNSIGNED_TOK .  [$end, CPAREN_TOK, BIT_OR_TOK, BIT_AND_TOK, BIT_XOR_TOK, BIT_LSH_TOK, BIT_RSH_TOK]

    $end         reduce using rule 10 (arithmetic_factor)
    $end         [reduce using rule 36 (bitwise_factor)]
    CPAREN_TOK   reduce using rule 10 (arithmetic_factor)
    CPAREN_TOK   [reduce using rule 36 (bitwise_factor)]
    BIT_OR_TOK   reduce using rule 36 (bitwise_factor)
    BIT_AND_TOK  reduce using rule 36 (bitwise_factor)
    BIT_XOR_TOK  reduce using rule 36 (bitwise_factor)
    BIT_LSH_TOK  reduce using rule 36 (bitwise_factor)
    BIT_RSH_TOK  reduce using rule 36 (bitwise_factor)
    $default     reduce using rule 10 (arithmetic_factor)

This output represents one possible "state" in the state machine implementing the parsing algorithm.

First there are a number of lines showing a possible position within a grammar rule. A period (.) always shows the current position within the rule. When the period is at the very end of a rule, bison might follow that with a list in [square brackets] of all the terminal tokens which might be valid immediately after the non-terminal symbol resulting from the rule.

Next there is a table of the actions the parser will take given the current State and the next token in the input stream. Conflicts show up as multiple entries for the same token, with actions after the first in [square brackets] (to indicate that the action would be valid given the ambiguous grammar, but the parser will never actually take that action).

So in the State 9 output, we can see the problem is that when an UNSIGNED_TOK token is followed by the end of parser input or by the CPAREN_TOK token, bison can't determine whether the number should be an arithmetic_factor or bitwise_factor. For end of input, perhaps this doesn't matter much and the issue could be avoided by fiddling with the root non-terminal. But the closed parenthesis case is a problem. Since bison (by default) uses an LALR(1) grammar, after processing the first two tokens in the text ( 0u ), the parser needs to decide what to do with 0u using only the single lookahead token ). But if it decides to make it an arithmetic_factor and the input is ( 0u ) & 1u, it's wrong; if it decides to make it a bitwise_factor and the input is ( 0u ) + 1u, it's wrong.

To fix problems like this, it's often helpful to think of the grammar rules in terms of semantic actions (even in cases where a grammar is being used just to determine whether input is valid or not and there won't be any semantic actions). What action should an interpreter take for the expression ( 0u )? Ideally, none at all: the expression should have the same representation and effect as just plain 0u. This puts it in a different category from both compound arithmetic expressions and compound bitwise expressions, since those have more limited uses (at least in the grammar shown).

But if we want to say for example ( 0u ) is NOT an arithmetic_expression, it seems we might be going toward a ridiculous number of rules for arithmetic_expression to list the cross-product of all the acceptable operand types. We can avoid this by using a rule for an arithmetic_operand, which the parser will use only for a subexpression of an actual arithmetic operator (not including parentheses). To allow multiple operators, any arithmetic_expression can also be used as an arithmetic_operand.

So here's a version of your grammar (after the same token declarations) without the reduce-reduce conflicts:

%%

expression
    : int_constant
    | float_constant
    | bool_constant
    | string_constant
    | exponent_constant
    | symbol_parens
    | arithmetic_expression
    | boolean_expression
    | bitwise_expression
    ;

compound_symbol
    : SYMBOL_TOK
    | compound_symbol DOT_TOK SYMBOL_TOK
    ;

symbol_parens
    : compound_symbol
    | OPAREN_TOK symbol_parens CPAREN_TOK
    ;

int_constant
    : INTEGER_TOK
    | UNSIGNED_TOK
    | OPAREN_TOK int_constant CPAREN_TOK
    ;

float_constant
    : FLOAT_TOK
    | OPAREN_TOK float_constant CPAREN_TOK
    ;

bool_constant
    : TRUE_TOK
    | FALSE_TOK
    | OPAREN_TOK bool_constant CPAREN_TOK
    ;

string_constant
    : STRING_TOK
    | OPAREN_TOK string_constant CPAREN_TOK
    ;

exponent_operand
    : FLOATING_TOK
    | INTEGER_TOK
    ;

exponent_constant
    : exponent_operand CARAT_TOK exponent_operand
    | OPAREN_TOK exponent_constant CPAREN_TOK
    ;

arithmetic_operand
    : int_constant
    | float_constant
    | exponent_constant
    | symbol_parens
    | arithmetic_expression
    ;

arithmetic_expression
    : arithmetic_operand PLUS_TOK arithmetic_operand
    | arithmetic_operand MINUS_TOK arithmetic_operand
    | arithmetic_operand MULT_TOK arithmetic_operand
    | arithmetic_operand DIV_TOK arithmetic_operand
    | MINUS_TOK arithmetic_operand %prec NEGATION
    | OPAREN_TOK arithmetic_expression CPAREN_TOK
    ;

boolean_operand
    : bool_constant
    | int_constant
    | float_constant
    | exponent_constant
    | string_constant
    | symbol_parens
    | boolean_expression
    ;

boolean_expression
    : boolean_operand OR_TOK boolean_operand
    | boolean_operand AND_TOK boolean_operand 
    | boolean_operand EQ_TOK boolean_operand 
    | boolean_operand NEQ_TOK boolean_operand 
    | boolean_operand LEQ_TOK boolean_operand 
    | boolean_operand GEQ_TOK boolean_operand 
    | boolean_operand MORE_TOK boolean_operand 
    | boolean_operand LESS_TOK boolean_operand 
    | NOT_TOK boolean_operand 
    | OPAREN_TOK boolean_expression CPAREN_TOK
    ;

bitwise_operand
    : int_constant
    | symbol_parens
    | bitwise_expression
    ;

bitwise_expression
    : bitwise_operand BIT_AND_TOK bitwise_operand
    | bitwise_operand BIT_OR_TOK bitwise_operand
    | bitwise_operand BIT_XOR_TOK bitwise_operand
    | bitwise_operand BIT_LSH_TOK bitwise_operand
    | bitwise_operand BIT_RSH_TOK bitwise_operand
    | BIT_NOT_TOK bitwise_operand
    | OPAREN_TOK bitwise_expression CPAREN_TOK 
    ; 
%% 

The 102 shift-reduce conflicts are still there, but they all could be solved by specifying precedence and associativity for the operators in boolean_expression and bitwise_expression rules.

One note, though: Possibly this was unintentional, but your grammar doesn't allow mixing operators from different "categories". So for example, inputs (1 + 2 < 4) and (5 & 6 == 4) are not valid.

aschepler
  • 70,891
  • 9
  • 107
  • 161
  • Thanks for the great answer. I did "solve" the shift-reduce conflicts by making all of those operators %left. Making them %nonassoc solves the shift-reduce issues, too, but I think I want them to associate left, like C does. Also, I do intend that boolean and, for example, arithmetic expressions do not mix. For example x=2+4; if(x >= 10) {#do this#} The central idea of this language is to be simple to use. In fact I am calling it "simple". I am combining features that I like from Python and C to make a "c-like" language that can express an entire application with no whitespace at all. – Chuck Tilbury Sep 11 '18 at 21:07
  • I would be very interested in your opinion on separating the expression types. – Chuck Tilbury Sep 11 '18 at 21:19