0

I am implementing a parser for a subset of Java using Java CUP.

The grammar is like

vardecl ::= type ID
type    ::= ID | INT | FLOAT | ...
exp     ::= ID | exp LBRACKET exp RBRACKET | ...
stmt    ::= ID ASSIGN exp SEMI

This works fine, but when I add

stmt ::= ID ASSIGN exp SEMI
        |ID LBRACKET exp RBRACKET ASSIGN exp SEMI 

CUP won't work, the warnings are:

Warning : *** Shift/Reduce conflict found in state #122
  between exp ::= identifier (*) 
  and     statement ::= identifier (*) LBRACKET exp RBRACKET ASSIGN exp SEMI 
  under symbol LBRACKET
  Resolved in favor of shifting.

Warning : *** Reduce/Reduce conflict found in state #42
  between type ::= identifier (*) 
  and     exp ::= identifier (*) 
  under symbols: {}
  Resolved in favor of the first production.

Warning : *** Shift/Reduce conflict found in state #42
  between type ::= identifier (*) 
  and     statement ::= identifier (*) LBRACKET exp RBRACKET ASSIGN exp SEMI 
  under symbol LBRACKET
  Resolved in favor of shifting.

Warning : *** Shift/Reduce conflict found in state #42
  between exp ::= identifier (*) 
  and     statement ::= identifier (*) LBRACKET exp RBRACKET ASSIGN exp SEMI 
  under symbol LBRACKET
  Resolved in favor of shifting.

I think there are two problems:
1. type ::= ID and exp ::= ID, when the parser sees an ID, it wants to reduce it, but it doesn't know which to reduce, type or exp.

  1. stmt ::= ID LBRACKET exp RBRACKET ASSIGN exp SEMI is for assignment of an element in array, such as arr[key] = value;
    exp :: exp LBRACKET exp RBRACKET is for expression of get an element from array, such as arr[key]

So in the case arr[key], when the parser sees arr, it knows that it is an ID, but it doesn't know if it should shift or reduce to exp.

However, I have no idea of how to fix this, please give me some advice if you have, thanks a lot.

Jiahao Cai
  • 1,222
  • 1
  • 11
  • 25

1 Answers1

3

Your analysis is correct. The grammar is LR(2) because declarations cannot be identified until the ] token is seen, which will be the second-next token from the ID which could be a type.

One simple solution is to hack the lexer to return [] as a single token when the brackets appear as consecutive tokens. (The lexer should probably allow whitespace between the brackets, too, so it's not quite trivial but it's not complicated.) If a [ is not immediately followed by a ], the lexer will return it as an ordinary [. That makes it easy for the parser to distinguish between assignment to an array (which will have a [ token) and declaration of an array (which will have a [] token).

It's also possible to rewrite the grammar, but that's a real nuisance.

The second problem -- array indexing assignment versus array indexed expressions. Normally programming languages allow assignment of the form:

exp [ exp ] = exp

and not just ID [ exp ]. Making this change will delay the need to reduce until late enough for the parser to identify the correct reduction. Depending on the language, it's possible that this syntax is not semantically meaningful but checking that is in the realm of type checking (semantics) not syntax. If there is some syntax of that form which is meaningful, however, there is no obvious reason to prohibit it.

Some parser generators implement GLR parsers. A GLR parser would have no problem with this grammar because it is no ambiguous. But CUP isn't such a generator.

rici
  • 234,347
  • 28
  • 237
  • 341
  • Thanks for your answer. But I think it cannot return `[]` as a single token in my case, because there is an expression `key` between `[` and `]` in the case `arr[key] = value`. – Jiahao Cai Dec 08 '18 at 16:42
  • @jiahao: You only return it as a single token if there is nothing but whitespace between the `[` and the `]`. `arr [ key ]` is still four tokens, but `arr [ ]` it just two. – rici Dec 08 '18 at 17:22
  • Yes, but the problem is the parser still cannot distinguish if `arr[key]` is part of `exp` or `stmt`. If `arr[key]` is part of `arr[key] = value`, then it is a `stmt`, and `arr` should be shifted as an `ID`; if `arr[key]` is part of `arr[key].init()`, then `arr` should be reduced as `exp`. Sorry I think I forget to add a rule, which is `stmt ::= exp DOT ID LPAREN parameter_list RPAREN SEMI`, this rule is for function calls. – Jiahao Cai Dec 08 '18 at 18:51
  • @jiahao: that's a different problem and it has a different solution. The simplest one is to allow assignments of the form `exp [ exp ] =`, which is valid in most languages. – rici Dec 08 '18 at 19:58
  • Yes I agree, thank you so much for your answers, you are so helpful ;) – Jiahao Cai Dec 08 '18 at 20:41