0

I'm trying to build a C compiler from scratch. Googling took me to the https://craftinginterpreters.com/ site. There are a lot of explanations how source code parsers work in detail. I've reached "Parsing expressions" chapter and stuck in this.
They offer that algorithm of creating parse tree (recursive descend)

Expr expression() {
    Expr left = equality();
    ...
}

Expr equality() {
    Expr left = comparison();
    ...
}

Expr comparison() {
    Expr left = term();
    ...
}

Expr term() {
    Expr left = factor();
    ...
}

Expr factor() {
    Expr left = unary();
    ...
}

Expr unary() {
    if (WeFoundToken('-')) {
        return Expr.Unary(prev(), unary());
    }
    return primary; // if there's no '-'
}

But what happens when we parse that simple expression: 1 * 2 - 3
I cannot understand how that's going to be working because top-level expression() when called falls through all lower-precedence operators parsing functions to the lowest precedence unary() function which will iterate through tokens, find (-) and return it to us as a unary (-) operation (-3) expression
In that case, there will be no way to build a valid tree that would look like this:

   -
 *   3
1 2

That would be invalid without a root node:

 *  
1 2  (-3)
IC_
  • 1,624
  • 1
  • 23
  • 57
  • 1
    Wouldn't the `term()` rule cover the case of ` - 3` before you reach `unary()`? – Mathias R. Jessen Apr 01 '21 at 18:23
  • @mathiasr.jessen as i understand it calls `factor()` which immediately calls `unary()` as the first operation without extra logic. Then `unary()` finds the first token of `-` and so on – IC_ Apr 01 '21 at 18:47

2 Answers2

2

Let's step through the example expression 1 * 2 - 3. The parser starts at the beginning.

  • The first token matches 1 in primary. Consume 1.
  • Control returns to factor, expr is set to 1. The while condition then matches the *. Consume *.
    • In the while loop, it tries to consume a unary next, this successfully matches primary 2. Consume 2.
    • Expression is set to 1 * 2. No more [*/] to consume, loop terminates. Return this expression to term.
  • term enters while loop and sees -. Consumed - (meaning the next token is 3, not -)
    • Tries to consume a factor, which successfully matches 3 in primary. Consume the 3.
    • Expression is set to 1 * 2 - 3.

This results in the tree:

    -
  *   3
 1 2

In other words, because term has already consumed 1 * 2 as a factor, term will enter the while loop, not call factor again. This successfully recognizes the - as an operator in term instead of part of a unary expression.

General Grievance
  • 4,555
  • 31
  • 31
  • 45
  • could you, please, explain the flow for `1 - 2 * 3`? – IC_ Apr 02 '21 at 03:07
  • I got it when more precedence operation comes first in the statement but when it goes the second I don't see the same logic works correctly in that case because the first token will become "primary" anyways and woun't take into account `- 2` part – IC_ Apr 02 '21 at 06:57
  • @Herrgott Sure. Starting with the `1` (unary) in `1 - 2 * 3`, control returns to `term` because the next token is a `-`. The `while` loop then tries to parse a `factor`, which eventually returns `2 * 3`. – General Grievance Apr 02 '21 at 11:51
1

unary() function which will iterate through tokens

No, it doesn't. It looks only at the next input token and sees that it is not a -. So it returns primary().

The actual code from the page your linked in your question is:

  private Expr unary() {
    if (match(BANG, MINUS)) {
      Token operator = previous();
      Expr right = unary();
      return new Expr.Unary(operator, right);
    }

    return primary();
  }

That function calls match, not WeFoundToken. (I'm not sure where that comes from). match is defined earlier on the page:

  private boolean match(TokenType... types) {
    for (TokenType type : types) {
      if (check(type)) {
        advance();
        return true;
      }
    }

    return false;
  }

The loop in that function (for (TokenType type : types)) loops over the arguments to the call, comparing each one in turn with the next input token. It never looks at any other input token.

rici
  • 234,347
  • 28
  • 237
  • 341