0

In the appendices of the Dragon-book, a LL(1) front end was given as a example. I think it is very helpful. However, I find out that for the context free grammar below, a at least LL(2) parser was needed instead.

statement : variable ':=' expression
          | functionCall

functionCall : ID'(' (expression ( ',' expression )*)? ')'
             ;
variable : ID
         | ID'.'variable
         | ID '[' expression ']'
         ;

How could I adapt the lexer for LL(1) parser to support k look ahead tokens? Are there some elegant ways?

I know I can add some buffers for tokens. I'd like to discuss some details of programming.


this is the Parser:

class Parser
{
    private Lexer lex;
    private Token look;
    public Parser(Lexer l)
    {
        lex = l;
        move();
    }

    private void move()
    {
        look = lex.scan();
    }
}

and the Lexer.scan() returns the next token from the stream.

zsf222
  • 345
  • 1
  • 10
  • I think (if I can recall my university notes correctly) that there is a theorem that tells you that LL(k) can always be reduced to LL(1) for all k. Exploiting that theorem would be an elegant way. – Bathsheba May 14 '14 at 07:22
  • I'd be happy to help but please show some code :-) – cadrian May 14 '14 at 07:22
  • @Bathsheba really? Could you please give some details? – zsf222 May 14 '14 at 07:26
  • @cadrian I have added some codes :) – zsf222 May 14 '14 at 07:29
  • @Bathsheba: You're thinking of the reduction of LR(k) grammars to LR(1). It's not possible for LL(k). – rici May 14 '14 at 14:28
  • Based on the title of the question, I am voting to close this question. The reason is that it is off-topic on SO to ask for any kind of favorite off-site resource or recommendation for a third party tool. – Michael J. Gray May 14 '14 at 16:55
  • The lexer is your smallest problem. Just read `k` tokens instead of 1. The real problem is creating the LL(k) parse table. Also, your particular grammar itself can be very easily converted to LL(1). – Shahbaz May 14 '14 at 17:11

1 Answers1

1

In effect, you need to buffer k lookahead tokens in order to do LL(k) parsing. If k is 2, then you just need to extend your current method, which buffers one token in look, using another private member look2 or some such. For larger k, you could use a ring buffer.

In practice, you don't need the full lookahead all the time. Most of the time, one-token lookahead is sufficient. You should structure the code as a decision tree, where future tokens are only consulted if necessary to resolve ambiguity. (It's often useful to provide a special token type, "unknown", which can be assigned to the buffered token list to indicate that the lookahead hasn't reached that point yet. Alternatively, you can just always maintain k tokens of lookahead; for handbuilt parsers, that can be simpler.)

Alternatively, you can use a fallback structure where you simply try one alternative and if that doesn't work, instead of reporting a syntax error, restore the state of the parser and lexer to the next alternative. In this model, the lexer takes as an explicit argument the current input buffer position, and the input buffer needs to be rewindable. However, you can use a lookahead buffer to effectively memoize the lexer function, which can avoid rewinding and rescanning. (Scanning is usually fast enough that occasional rescans don't matter, so you might want to put off adding code complexity until your profiling indicates that it would be useful.)

Two notes:

1) I'm skeptical about the rule:

functionCall : ID'(' (expression ( ',' expression )*)* ')'
             ;

That would allow, for example:

function(a[3], b[2] c[x] d[y], e.foo)

which doesn't look right to me. Normally, you'd mark the contents of the () as optional instead of repeatable, eg. using an optional marker ? instead of the second Kleene star *:

functionCall : ID'(' (expression ( ',' expression )*)? ')'
             ;

2) In my opinion, you really should consider using bottom-up parsing for an expression language, either a generated LR(1) parser or a hand-built Pratt parser. LL(1) is rarely adequate. Of course, if you're using a parser generator, you can use tools like ANTLR which effectively implement LL(∞); that will take care of the lookahead for you.

rici
  • 234,347
  • 28
  • 237
  • 341
  • And thank you for introducing me the Pratt parser. However, the example program in Dragon book which uses LL(1) parses a very typical expression language. So I think LL(1) is also OK in some cases. – zsf222 May 14 '14 at 16:51
  • @zsf222: Yes, it's ok in some cases. but as your question demonstrates, it has very definite limits. An LR(1) parser generator would have no problem whatever with your grammar. Of course, you can rearrange things so that it is top-down parseable but at some point it becomes easier to just go with a parser generator. IMHO. – rici May 14 '14 at 18:30