The problem is that a function_def
can only occur after a function_list
, which means that the parser needs to reduce an empty function_list
(using the production function_list → ε
) before it can recognize a function_def
. Furthermore, it needs to make that decision by only looking at the token which follows the empty production. Since that token (a type_name
) could start either a var_decl
or a function_def
, there is no way for the parser to decide.
Even leaving the decision for one more token won't help; it's not until the third token that the correct decision can be made. So your grammar is not ambiguous, but it is LR(3).
Sequences of possibly empty lists of different type always create this problem. By contrast, sequences of non-empty lists do not, so a first approach to solving the problem is to eliminate the ε-productions.
First, we expand the top-level definition to make it clear that both lists are optional:
program: global_list function_list;
| global_list
| function_list
|
;
Then we make both list types non-empty:
global_list
: var_decl
| global_list var_decl
;
function_list
: function_def
| function_list function_def
;
The rest of the grammar is unchanged.
type_name : TKINT /* int */
| TKFLOAT /* float */
| TKCHAR /* char */
var_decl : type_name NAME;
function_def : type_name NAME '(' param_list ')' '{' func_body '}' ;
It's worth noting that the problem would never have arisen if declarations could be interspersed. Is it really necessary that all global variables be defined before any function? If not, you could just use a single list type, which would also be conflict free:
program: decl_list ;
decl_list:
| decl_list var_decl;
| decl_list function_def
;
Both these solutions work because a bottom-up parser can wait until the end of the production being reduced in order to decide which is the correct reduction; it does not matter that var_decl
and function_def
look identical until the third token.
The problem really is that it's hard to figure out the type of nothing.