Is it possible to write a recursive-descent parser for this grammar?

Question

From this question, a grammar for expressions involving binary operators (+ - * /) which disallows outer parentheses:

top_level   : expression PLUS term
            | expression MINUS term
            | term TIMES factor
            | term DIVIDE factor
            | NUMBER
expression  : expression PLUS term
            | expression MINUS term
            | term
term        : term TIMES factor
            | term DIVIDE factor
            | factor
factor      : NUMBER
            | LPAREN expression RPAREN

This grammar is LALR(1). I have therefore been able to use PLY (a Python implementation of yacc) to create a bottom-up parser for the grammar.

For comparison purposes, I would now like to try building a top-down recursive-descent parser for the same language. I have transformed the grammar, removing left-recursion and applying left-factoring:

top_level   : expression top_level1
            | term top_level2
            | NUMBER
top_level1  : PLUS term
            | MINUS term
top_level2  : TIMES factor
            | DIVIDE factor
expression  : term expression1
expression1 : PLUS term expression1
            | MINUS term expression1
            | empty
term        : factor term1
term1       : TIMES factor term1
            | DIVIDE factor term1
            | empty
factor      : NUMBER
            | LPAREN expression RPAREN

Without the top_level rules this grammar is LL(1), so writing a recursive-descent parser would be fairly straightforward. Unfortunately, including top_level, the grammar is not LL(1).

Is there an "LL" classification for this grammar (e.g. LL(k), LL(*))?
Is it possible to write a recursive-descent parser for this grammar? How would that be done? (Is backtracking required?)
Is it possible to simplify this grammar to ease the recursive-descent approach?

rici · Accepted Answer · 2016-03-21T15:21:06.700

The grammar is not LL with finite lookahead, but the language is LL(1) because an LL(1) grammar exists. Pragmatically, a recursive descent parser is easy to write even without modifying the grammar.

Is there an "LL" classification for this grammar (e.g. LL(k), LL(*))?

If α is a derivation of expression, β of term and γ of factor, then top_level can derive both the sentence (α)+β and the sentence (α)*γ (but it cannot derive the sentence (α).) However, (α) is a possible derivation of both expression and term, so it is impossible to decide which production of top_level to use until the symbol following the ) is encountered. Since α can be of arbitrary length, there is no k for which a lookahead of k is sufficient to distinguish the two productions. Some people might call that LL(∞), but that doesn't seem to be a very useful grammatical category to me. (LL(*) is, afaik, the name of a parsing strategy invented by Terence Parr, and not an accepted name for a class of grammars.) I would simply say that the grammar is not LL(k) for any k.

Is it possible to write a recursive-descent parser for this grammar? How would that be done? (Is backtracking required?)

Sure. It's not even that difficult.

The first symbol must either be a NUMBER or a (. If it is a NUMBER, we predict (call) expression. If it is (, we consume it, call expression, consume the following ) (or declare an error, if the next symbol is not a close parenthesis), and then either call expression1 or term1 and then expression1, depending on what the next symbol is. Again, if the next symbol doesn't match the FIRST set of either expression1 or term1, we declare a syntax error. Note that the above strategy does not require the top_level* productions at all.

Since that will clearly work without backtracking, it can serve as the basis for how to write an LL(1) grammar.

Is it possible to simplify this grammar to ease the recursive-descent approach?

I'm not sure if the following grammar is any simpler, but it does correspond to the recursive descent parser described above.

top_level   : NUMBER optional_expression_or_term_1
            | LPAREN expression RPAREN expression_or_term_1
optional_expression_or_term_1: empty
            | expression_or_term_1
expression_or_term_1
            : PLUS term expression1
            | MINUS term expression1
            | TIMES factor term1 expression1
            | DIVIDE factor term1 expression1
expression  : term expression1
expression1 : PLUS term expression1
            | MINUS term expression1
            | empty
term        : factor term1
term1       : TIMES factor term1
            | DIVIDE factor term1
            | empty
factor      : NUMBER
            | LPAREN expression RPAREN

I'm left with two observations, both of which you are completely free to ignore (particularly the second one which is 100% opinion).

The first is that it seems odd to me to ban (1+2) but allow (((1)))+2, or ((1+2))+3. But no doubt you have your reasons. (Of course, you could easily ban the redundant double parentheses by replacing expression with top_level in the second production for factor.

Second, it seems to me that the hoop-jumping involved in the LL(1) grammar in the third section is just one more reason to ask why there is any reason to use LL grammars. The LR(1) grammar is easier to read, and its correspondence with the language's syntactic structure is clearer. The logic of the generated recursive-descent parser may be easier to understand, but to me that seems secondary.

+1 for remark about using LL(1) when stronger parser generators are easily found. Why buy a needless headache? — Ira Baxter, Mar 20 '16 at 05:00
There's another easy way to reject (....) based on the observation that any useful grammar for a language must accept at least that language, and may accept more (hard to be perfect). So what we typically do with a parser is "accept (slightly) too much, and reject the excess" (by using some machinery outside the parser). Rejecting the parse (.... ) is now easy: parse using a simple grammar that allows it, and reject any parse that has (...) at the top level. — Ira Baxter, Mar 20 '16 at 05:03
@IraBaxter: Yeah, I thought of mentioning that, but the formal grammar was easy enough and the code for a hand-built RDP would end up being pretty similar to what you'd end up with the reject strategy. (IMHO, you'd need a pretty good reason to ban outer parentheses in order to justify the time dedicated to answering this question :) ) — rici, Mar 20 '16 at 05:10
In your final grammar, `expression_or_term1` -> `expression1` -> `empty`, so `top_level` -> `'(' expression ')'`, which is not in the original grammar... — Chris Dodd, Mar 20 '16 at 20:01
@chrisdodd: good catch. That also makes for an ambiguity. I'll fix it, but it ain't gonna be pretty. — rici, Mar 20 '16 at 22:32
Thank you very much for this comprehensive answer! A few thoughts: **1**. Thanks for confirming my suspicion that "the grammar is not LL(k) for any k". **2**. Your recursive-descent parser seems equivalent to handling an initial `(` using `expression : LPAREN expression RPAREN term1 expression1`, with a final assertion that `term1` and `expression1` are not both empty. This does seem very similar to Ira Baxter's suggestion to parse the entire input as an `expression`, with a final assertion that there are no outer parentheses. Both approaches "accept slightly too much and reject the excess". — user200783, Mar 22 '16 at 09:40
**3**. As I try to decide between the LALR(1) and recursive-descent approaches to parsing, clearly the LR(1) grammar is preferable to the LL(1) grammar. I assume, from your 2nd observation, that you feel this simplicity of grammar outweighs the simplicity of the top-down parser itself? That is, you prefer simple grammar/complex parser LALR(1) over complex grammar/simple parser LL(1)? There's also the "accept slightly too much" approach:a top-down parser for `expression` modified to reject outer parens. Would you also prefer LALR(1) over this (potentially simple grammar/simple parser) approach? — user200783, Mar 22 '16 at 09:40
@paul: yes, there is little difference between my approach and Ira's. The parae-and-check strategy is often a good pragmatic approach, regardless of underlying parsing technology. — rici, Mar 22 '16 at 13:38
@paul: since the lalr parser is generated for me, its simplicity or complexity is irrelevant. (I don't choose compilers on the basis of the understandability of their generated code, either.) Independent of your particular issue, I find the LL grammar inferior. `f: t | t '+' f` captures the syntax and semantics of the arithmetic syntax; `f: t f'; f': | '+' f'` does not. — rici, Mar 22 '16 at 13:45

Chris Dodd · Answer 2 · 2016-03-20T18:37:55.503

To make the grammar LL(1) you need to finish left-factoring top_level. You stopped at:

top_level   : expression top_level1
            | term top_level2
            | NUMBER

expression and term both have NUMBER in their FIRST sets, so they must first be substituted to left-factor:

top_level   : NUMBER term1 expression1 top_level1
            | NUMBER term1 top_level2
            | NUMBER
            | LPAREN expression RPAREN term1 expression1 top_level1
            | LPAREN expression RPAREN term1 top_level2

which you can then left-factor to

top_level   : NUMBER term1 top_level3
            | LPAREN expression RPAREN term1 top_level4

top_level3  : expression1 top_level1
            | top_level2
            | empty

top_level4  : expression1 top_level1
            | top_level2

Note that this still is not LL(1) as there are epsilon rules (term1, expression1) with overlapping FIRST and FOLLOW sets. So you need to factor those out too to make it LL(1)

Thanks. It looks like, if the left-factoring is done slightly differently (so that both `top_level3` and `top_level4` begin with `term1`), further transformations will lead to the LL(1) grammar given by rici (`top_level3` becomes `optional_expression_or_term_1` and `top_level4` becomes `expression_or_term_1`). — user200783, Mar 22 '16 at 09:40

Is it possible to write a recursive-descent parser for this grammar?

2 Answers2