How to handle same symbol used for two things lemon parser

Question

I'm developing a domain specific language. Part of the language is exactly like C expression parsing semantics such as precidence and symbols.

I'm using the Lemon parser. I ran into an issue of the same token being used for two different things, and I can't tell the difference in the lexer. The ampersand (&) symbol is used for both 'bitwise and' and "address of".

At first I thought it was a trivial issue, until I realized that they don't have the same associativity.

How do I give the same token two different associativities? Should I just use AMP (as in ampersand) and make the addressof and bitwise and rules use AMP, or should I use different tokens (such as ADDRESSOF and BITWISE_AND). If I do use separate symbols, how am I supposed to know which one from the lexer (it can't know without being a parser itself!).

+1 for compensating the pain since you'll have to write this by hand. — , Dec 28 '12 at 21:40
I'm not sure where to start though. Should I attempt to resolve it at the syntax tree level or should I try to detect it in the parser (by peeking at the recent token stream for example). — doug65536, Dec 28 '12 at 21:56
In the parser. The AST must be unambiguous. The parser is what does the math & logic. — , Dec 28 '12 at 21:57
@H2CO3 You may have been thinking of how a true C parser has trouble because it has to know what is a type name. I resolved that by adding a cast keyword. I think I have (posted) the answer but I won't accept it until I have tested it. — doug65536, Dec 28 '12 at 23:24

rici · Accepted Answer · 2012-12-29T14:34:59.870

If you're going to write the rules out explicitly, using a different non-terminal for every "precedence" level, then you do not need to declare precedence at all, and you should not do so.

Lemon, like all yacc-derivatives, uses precedence declarations to remove ambiguities from ambiguous grammars. The particular ambiguous grammar referred to is this one:

expression: expression '+' expression
          | expression '*' expression
          | '&' expression
          | ... etc, etc.

In that case, every alternative leads to a shift-reduce conflict. If your parser generator didn't have precedence rules, or you wanted to be precise, you'd have to write that as an unambiguous grammar (which is what you've done):

term: ID | NUMBER | '(' expression ')' ;
postfix_expr:        term | term '[' expression '] | ... ;
unary_expr:          postfix_expr | '&' unary_expr | '*' unary_expr | ... ;
multiplicative_expr: unary_expr | multiplicative_expr '*' postfix_expr | ... ;
additive_expr:       multiplicative_expr | additive_expr '+' multiplicative_expr | ... ;
...
assignment_expr:     conditional_expr | unary_expr '=' assignment_expr | ...; 
expression:          assignment_expr ;
[1]

Note that the unambiguous grammar even shows left-associative (multiplicative and additive, above), and right-associative (assignment, although it's a bit weird, see below). So there are really no ambiguities.

Now, the precedence declarations (%left, %right, etc.) are only used to disambiguate. If there are no ambiguities, the declarations are ignored. The parser generator does not even check that they reflect the grammar. (In fact, many grammars cannot be expressed as this kind of precedence relationship.)

Consequently, it's a really bad idea to include precedence declarations if the grammar is unambiguous. They might be completely wrong, and mislead anyone who reads the grammar. Changing them will not affect the way the language is parsed, which might mislead anyone who wants to edit the grammar.

There is at least some question about whether it's better to use an ambiguous grammar with precedence rules or to use an unambiguous grammar. In the case of C-like languages, whose grammar cannot be expressed with a simple precedence list, it's probably better to just use the unambiguous grammar. However, unambiguous grammars have a lot more states and may make parsing slightly slower, unless the parser generator is able to optimize away the unit-reductions (all of the first alternatives in the above grammar, where each expression-type might just be the previous expression-type without affecting the AST; each of these productions needs to be reduced, although it's mostly a no-op, and many parser generators will insert some code.)

The reason C cannot simply be expressed as a precedence relationship is precisely the assignment operator. Consider:

a = 4 + b = c + 4;

This doesn't parse because in assignment-expression, the assignment operator can only apply on the left to a unary-expression. This doesn't reflect either possible numeric precedence between + and =. [2]

If + were of higher precedence than =, the expression would parse as:

a = ((4 + b) = (c + 4));

and if + were lower precedence, it would parse as

(a = 4) + (b = (c + 4));

[1] I just realized that I left out cast_expression but I can't be cast to put it back in; you get the idea)

[2] Description fixed.

It's been years since I worked on generated parsers (and it was flex/bison). Thanks, this is a great refresher and answer. — doug65536, Dec 29 '12 at 02:25
GCC rejects `a = 4 + b = c + 4;` with the error `lvalue required as left operand of assignment`. You'd have to introduce parentheses explicitly to get the assignment to `b` acceptable (minimally: `a = 4 + (b = c + 4);`). — Jonathan Leffler, Dec 29 '12 at 06:59
@JonathanLeffler: quite right (and the rules are different in c++). — rici, Dec 29 '12 at 14:37
Is it likely for the parser to be faster if the grammar is a flat list of rules with ambiguities resolved by the associativity declarations? I enabled tracing on my unambiguous grammar and I am seeing it go through several patterns before it resolves to the most common one. I was wondering if it could be worth rewriting it as an ambiguous grammar because it also significantly improves readability. If it also improves performance then I will definitely switch to ambiguous. My intuition says that the parser generator will have to have those states anyway. Will it? — doug65536, Jan 02 '13 at 19:45
@doug65536 iirc, lemon does not do unit rule elimination so if you can collapse all those *-expression into a single non-terminal you will save a bunch of unit reductions. I can't promise an observable speedup but the parser does end up doing less.tias . — rici, Jan 02 '13 at 21:29
I've confirmed that the number of state transitions is drastically reduced by using precidence declarations rather than extra nonterminals to implement precedence. It also significantly improved the readability and consistency of the input grammar source code (everything is lined up and the different parts are consistently positioned). — doug65536, Jan 04 '13 at 19:22

doug65536 · Answer 2 · 2012-12-28T22:43:56.707

Later I realized I had the same ambiguity between dereference (*) and multiplication, also (*).

Lemon provides a way to assign a precidence to a rule, using the name used in the associativity declarations (%left/right/nonassoc) in square brackets after the period.

I haven't verified that this works correctly yet, but I think you can do this (note the things in square brackets near the end):

.
.
.

%left COMMA.
%right QUESTION ASSIGN
    ADD_ASSIGN SUB_ASSIGN MUL_ASSIGN DIV_ASSIGN MOD_ASSIGN
    LSH_ASSIGN RSH_ASSIGN AND_ASSIGN XOR_ASSIGN OR_ASSIGN.
%left LOGICAL_OR.
%left LOGICAL_AND.
%left BITWISE_OR.
%left BITWISE_XOR.
%left BITWISE_AND.
%left EQ NE.
%left LT LE GT GE.
%left LSHIFT RSHIFT.
%left PLUS MINUS.
%left TIMES DIVIDE MOD.
//%left MEMBER_INDIRECT ->* .*
%right INCREMENT DECREMENT CALL INDEX DOT INDIRECT ADDRESSOF DEREFERENCE.

.
.
.

multiplicative_expr ::= cast_expr.
multiplicative_expr(A) ::= multiplicative_expr(B) STAR cast_expr(C). [TIMES]
    { A = Node_2_Op(Op_Mul, B, C); }
.
.
.
unary_expr(A) ::= STAR unary_expr(B). [DEREFERENCE]
    { A = Node_1_Op(Op_Dereference, B); }

How to handle same symbol used for two things lemon parser

2 Answers2