flex&bison shift/reduce conflict

Question

Here are part of my grammar:

expr_address
: expr_address_category expr_opt    { $$ = new ExprAddress($1,*$2);}
| axis axis_data                    { $$ = new ExprAddress($1,*$2);}
;
axis_data
: expr_opt          { $$ = $1;}
| sign              { if($1 == MINUS)
                        $$ = new IntergerExpr(-1000000000);
                      else if($1 == PLUS)
                        $$ = new IntergerExpr(+1000000000);}
;
expr_opt
:               { $$ = new IntergerExpr(0);}
| expr          { $$ = $1;}
;
expr_address_category
: I                 { $$ = NCAddress_I;}
| J                 { $$ = NCAddress_J;}
| K                 { $$ = NCAddress_K;}
;
axis
: X                 { $$ = NCAddress_X;}
| Y                 { $$ = NCAddress_Y;}
| Z                 { $$ = NCAddress_Z;}
| U                 { $$ = NCAddress_U;}
| V                 { $$ = NCAddress_V;}
| W                 { $$ = NCAddress_W;}
;
expr
: '[' expr ']'              {$$ = $2;}
| COS parenthesized_expr    {$$ = new BuiltinMethodCallExpr(COS,*$2);}
| SIN parenthesized_expr    {$$ = new BuiltinMethodCallExpr(SIN,*$2);}
| ATAN parenthesized_expr   {$$ = new BuiltinMethodCallExpr(ATAN,*$2);}
| SQRT parenthesized_expr   {$$ = new BuiltinMethodCallExpr(SQRT,*$2);}
| ROUND parenthesized_expr  {$$ = new BuiltinMethodCallExpr(ROUND,*$2);}
| variable                  {$$ = $1;}
| literal                           
| expr '+' expr                 {$$ = new BinaryOperatorExpr(*$1,PLUS,*$3);}
| expr '-' expr                 {$$ = new BinaryOperatorExpr(*$1,MINUS,*$3);}
| expr '*' expr                 {$$ = new BinaryOperatorExpr(*$1,MUL,*$3);}
| expr '/' expr                 {$$ = new BinaryOperatorExpr(*$1,DIV,*$3);}
| sign expr %prec UMINUS        {$$ = new UnaryOperatorExpr($1,*$2);}
| expr EQ expr                  {$$ = new BinaryOperatorExpr(*$1,EQ,*$3);}
| expr NE expr                  {$$ = new BinaryOperatorExpr(*$1,NE,*$3);}
| expr GT expr                  {$$ = new BinaryOperatorExpr(*$1,GT,*$3);}
| expr GE expr                  {$$ = new BinaryOperatorExpr(*$1,GE,*$3);}
| expr LT expr                  {$$ = new BinaryOperatorExpr(*$1,LT,*$3);}
| expr LE expr                  {$$ = new BinaryOperatorExpr(*$1,LE,*$3);}
;
variable 
: d_h_address               {$$ = new AddressExpr(*$1);}
;
d_h_address
: D INTEGER_LITERAL     { $$ = new IntAddress(NCAddress_D,$2);}
| H INTEGER_LITERAL     { $$ = new IntAddress(NCAddress_H,$2);}
;

I hope my grammar support that like:

H100=20;
X;
X+0;
X+;
X+H100;   //means H100 variable ref

The top two are same with X0; By the way,sign -> +/-;

But bison report conflicts,the key part of bison.output:

State 108

11 expr: sign . expr
64 axis_data: sign .

INTEGER_LITERAL  shift, and go to state 93
REAL_LITERAL     shift, and go to state 94
'+'              shift, and go to state 74
'-'              shift, and go to state 75
COS              shift, and go to state 95
SIN              shift, and go to state 96
ATAN             shift, and go to state 97
SQRT             shift, and go to state 98
ROUND            shift, and go to state 99
D                shift, and go to state 35
H                shift, and go to state 36
'['              shift, and go to state 100

D         [reduce using rule 64 (axis_data)]
H         [reduce using rule 64 (axis_data)]
$default  reduce using rule 64 (axis_data)

State 69

62 expr_address: axis . axis_data

INTEGER_LITERAL  shift, and go to state 93
REAL_LITERAL     shift, and go to state 94
'+'              shift, and go to state 74
'-'              shift, and go to state 75
COS              shift, and go to state 95
SIN              shift, and go to state 96
ATAN             shift, and go to state 97
SQRT             shift, and go to state 98
ROUND            shift, and go to state 99
D                shift, and go to state 35
H                shift, and go to state 36
'['              shift, and go to state 100

D         [reduce using rule 65 (expr_opt)]
H         [reduce using rule 65 (expr_opt)]
$default  reduce using rule 65 (expr_opt)

State 68

61 expr_address: expr_address_category . expr_opt

INTEGER_LITERAL  shift, and go to state 93
REAL_LITERAL     shift, and go to state 94
'+'              shift, and go to state 74
'-'              shift, and go to state 75
COS              shift, and go to state 95
SIN              shift, and go to state 96
ATAN             shift, and go to state 97
SQRT             shift, and go to state 98
ROUND            shift, and go to state 99
D                shift, and go to state 35
H                shift, and go to state 36
'['              shift, and go to state 100

D         [reduce using rule 65 (expr_opt)]
H         [reduce using rule 65 (expr_opt)]
$default  reduce using rule 65 (expr_opt)

I don't know how to deal with this,thanks advance.

EDIT： I make a minimal grammar：

    %{
    #include <stdio.h>
    extern "C" int yylex();
    void yyerror(const char *s) { printf("ERROR: %s/n", s); }
%}

%token PLUS '+'  MINUS '-' 

%token D H I J K X Y Z INT

/*%type sign expr var expr_address_category expr_opt
%type axis */

%start word_list

%%
/*Above grammar lost this rule,it makes ambiguous*/
word_list
    : word
    | word_list word
    ;
sign
    : PLUS
    | MINUS
    ;
expr
    : var
    | sign expr
    | '[' expr ']'
    ;
var 
    : D INT
    | H INT
    ;
word
    : expr_address
    | var '=' expr
    ;
expr_address
    : expr_address_category expr_opt
    /*| '(' axis sign ')'*/
    | axis sign
    ;
expr_opt
    : /* empty */
    | expr
    ;
expr_address_category
    : I 
    | J
    | K
    | axis
    ;
axis
    : X
    | Y
    | Z
    ;
%%

and I hope it can support：

X;
X0;
X+0;  //the top three are same with X0
X+;
X+H100;  //this means X's data is ref +H100;
X+H100=10; //two word on a block,X+ and H100=10;
XH100=10;  //two word on a block,X and H100=10;

EDIT2:

The above EDIT lost this rule.

block
    : word_list ';' 
    | ';'
    ;

Because I have to allow such grammar:

H000 = 100 H001 = 200 H002 = 300;

I think we need more of the grammar, to see the context in which `expr_address` can appear. Also, I didn't understand how to apply "the top two are the same with `X0`" to `H100=20;` — rici, Mar 11 '17 at 15:11
I checked the fragment you provided; as I expected, it has no conflicts. In order to check it, I had to add declarations of various token types, provide a definition of `sign` and `parenthesized_expr` (I used `'(' expr ')'`), and eliminate the `literal` option from `expr` as I have no idea what that would be. There are various assumptions there so obviously if they are wrong, there might be conflicts as a result. Note that [ask] indicates the need for a [mcve]; it's worth taking the time to read those requests because they will help you ask questions which can be answered. — rici, Mar 11 '17 at 20:04
@rici I think the conflict like this: 'X+'H100 or 'X'+H100,whict means X+ reduce as 'axis axis_data',the other 'X' also can reduce as 'axis axis_data',it creates such as shift more one H or reduces as 'axis axis_data' conflict. — Shun, Mar 12 '17 at 09:46
@rici I am sorry,X0 and X+0 are same....I edit it and forget to change it. — Shun, Mar 12 '17 at 09:52
Your grammar -- the part you show -- does not allow an `expr_address` to be followed by an `expr` so interpreting `X+` to `axis axis_data` is not possible with a lookahead of `H`. As I said, the grammar fragment works fine on its own so the problem is an interaction with some rule you haven't shown us. — rici, Mar 13 '17 at 05:28
@rici I post a minimal grammar which report same conflict,can you help me one more? I am sorry for my poor English. — Shun, Mar 14 '17 at 02:01
You are quite right that the wordlist production is responsible for conflicts. All of your examples are written with semicolons. Is that not acceptable? — rici, Mar 14 '17 at 03:27
@rici yeah..it can not accept word with semicolons..see the last EDIT2. — Shun, Mar 14 '17 at 06:33

score 1 · Accepted Answer · edited May 23 '17 at 12:32

This is essentially the classic LR(2) grammar, except that in your case it is LR(3) because your variables consist of two tokens [Note 1]:

var : D INT | H INT

The basic problem is the concatenation of words without separators:

word_list : word | word_list word

combined with the fact that one of the options for word ends with an optional var:

word: expr_address
expr_address: expr_address_category expr_opt

while the other one starts with a var:

word: var '=' expr

The = makes this unambiguous, since nothing in an expr can contain that symbol. But at the point where a decision needs to be made, the = is not visible, because the lookahead is the first token of a var -- either an H or a D -- and the equals sign is still two tokens away.

This LR(2) grammar is very similar to the grammar used by yacc/bison itself, a fact which I always find to be ironic, because the grammar for yacc does not require ; between productions:

production: SYMBOL ':' | production SYMBOL  /* Lots of detail omitted */

As with your grammar, this makes it impossible to know whether a SYMBOL should be shifted or trigger a reduce because the disambiguating : is still not visible.

Since the grammar is (I assume) unambiguous, and bison can now generate GLR parsers, that will be the simplest solution: just add

%glr-parser

to your bison prologue (but read the section of the bison manual on GLR parsers to understand the trade-off).

Note that the shift-reduce conflicts will still be reported as warnings; since it is impossible to reliably decide whether a grammar is ambiguous, bison doesn't attempt to do so and ambiguities will be reported at run-time if they exist.

You should also fix the issue mentioned in @ChrisDodd's answer regarding the refactoring of expr_address (although with a GLR parser it is not strictly necessary).

If, for whatever reason, you feel that a GLR parser will not meet your needs, you could use the solution in most implementations of yacc (including bison), which is a hack in the lexical scanner. The basic idea is to mark whether a symbol is followed by a colon or not in the lexer, so that the above production could be rewritten as:

production: SYMBOL_COLON | production SYMBOL

This solution would work for you if you were willing to combine the letter and the number into a single token:

word: expr_address expr_opt
    | VARIABLE_EQUALS expr
// ...
expr: VARIABLE

My preference is to do this transformation in a wrapper around the lexer, which keeps a (one-element) queue of pending tokens:

/* The use of static variables makes this yylex wrapper unreliable
 * if it is reused after a syntax error.
 */
int yylex_wrapper() {
  static int saved_token = -1;
  static YYSTYPE saved_yylval = {0};

  int token = saved_token;
  saved_token = -1;
  yylval = saved_yylval;
  // Read a new token only if we don't have one in the queue.
  if (token < 0) token = yylex();
  // If the current token is IDENTIFIER, check the next token
  if (token == IDENTIFIER) {
    // Read the next token into the queue (saved_token / saved_yylval)
    YYSTYPE temp_val = yylval;
    saved_token = yylex();
    saved_yylval = yylval;
    yylval = temp_val;
    // If the second token is '=', then modify the current token
    // and delete the '=' from the queue
    if (saved_token == '=') {
        saved_token = -1;
        token = IDENTIFIER_EQUALS;
    }
  }
  return token;
}

Notes

Personally, I would start by making a var a single token (do you really want to allow people to write:
```
H /* Some comment in the middle of the variable name */ 100
```
but that's not going to solve any problems; it merely reduces the grammar's lookahead requirement from LR(3) to LR(2).

1.The customer just give me some example files and little constraints.If I can,I prefer making a var a single token such as `[DH][0-9]+`. Furthermore, I just parse source code file into `NCWord` sequence to back-end.`NCWord` is a `struct { NCAddress address; union{ double lfdata; int idata }code; }` and `NCAddress` is an `enum{ X,Y,Z,H,D,I,...}`. By the way, the interface between fore-end and back-end which I have design right. For now , it reachs what I expected. 2. — Shun, Mar 15 '17 at 02:49
As you said,the grammar is really ambiguous. And I can't make sure that my grammar is 100% true in GLR mode(I am the new for bison).The wrapper that you provide-- it is hard for me to use. As <> mentioned, any shift/reduce conflict can use precedence to solve.So,Has it any possible to solve the grammar conflict? — Shun, Mar 15 '17 at 03:13
Moreover, block-- `G00 X10Y10Z10;` and `G00;X10;Y10;Z10;` are not same means. the first block means X/Y/Z axis run 10 at same time and the other one means X run 10 then Y run 10 then Z run 10.So I have to design `;` as block separators. — Shun, Mar 15 '17 at 03:21
@Shun: I understand that `;` is a block separator. Nothing in this answer mentions `;` for that reason. — rici, Mar 15 '17 at 03:31
@Shun: And the grammar is not ambiguous (at least, I don't think it is ambiguous). It just isn't LR(1). But you cannot "solve" any shift/reduce conflict with precedence. You can force the parser to make a choice, but it may well be the wrong choice. I understand the problem with verifying GLR, but verifying precedence rules will be almost as challenging. You just need to have lots of tests cases. Finally, why is the wrapper hard for you to use? All you need to do is tell bison to use the wrapper instead of `yylex`. The simplest (not the cleanest) way to do that is a macro in the bison file. — rici, Mar 15 '17 at 03:35
Yeah..so,you give me two choice that `GLR` and `yylexwrap`or hack in flex?`GLR` need lots of tests case to make sure my grammar absolutely correct? I know `#define YYLEX yylex_wrapper()` , but as above comment, I have to tell back-end with `var` such as `H100` into a `NCWord` which `address = H` and 'code.idata = 100'. So I can't change `var` to a single token like `ID`(maybe..),then if I use wrapper which I have to keep two token in queue? — Shun, Mar 15 '17 at 06:49
You could use a three element queue but it would be easier to recognise a var as one token whose semantic value is a NCWord. — rici, Mar 15 '17 at 07:10
This my lex:`"D"{return D;}"H"{return H;} [0-9]+ {yylval = atoi(yytext);return INT;}` , if recognise a var as one token `[DH][0-9]+ {yylval.NCWord = ?? ;return ID;}`, what is about the yylval.NCWord value.To be honest, if `GLR `has not other side-effect,I will use`GLR` since it is simple for me.. — Shun, Mar 15 '17 at 08:32
I think a GLR parser will be fine. For the combined var, you would have `D[0-9]+ { yylval.ncword = (NCWord){ .address = D_ENUM, .code.idata = atoi(yytext+1)}; return ID; }` and similar for `H`. (Assumes modern C, otherwise you need two assgnments to fill in the NCWord fields.) — rici, Mar 15 '17 at 13:38

score 0 · Answer 2 · answered Mar 14 '17 at 03:01

The main problem is that it can't figure out where one word in a word_list ends and the next one begins, because there is no separator token between words. This is in contrast to your examples, which all have ; terminators. So that suggests one obvious fix -- put in the ; separators:

word: expr_address ';'
    | var '=' expr ';'

That fixes most of the problems, but leaves a lookahead conflict where it can't decide whether an axis is an expr_address_category or not when the lookahead is a sign, because it depends on whether there's an expr after the sign or not. You can fix that by refactoring to defer deciding:

expr_address
    : expr_address_category expr_opt
    | axis expr_opt
    | axis sign

..and remove axis from expr_address_category

You are quite right...But I am sorry for my fault that I can't put a word with ';' separators.See the last EDIT2. — Shun, Mar 14 '17 at 06:32

flex&bison shift/reduce conflict

2 Answers2

Notes

Linked