1

I have the following grammar:

IdentifierName ::
    IdentifierStart
    IdentifierName IdentifierPart

Which using the word git should be parsed into the following parse tree:

                 IdentifierName
                /              \
        IdentifierName IdentifierPart
       /              \         |
IdentifierName IdentifierPart  't'
       |                |
IdentiiferStart        'i'
       |
      'g'

I want to write a recursive descent algorithm to do that. Now I have two options either write a recursive descent parser with backtracking or a predictive recursive descent parser. These both are not table-drive parsers. However, I've read that for the recursive descent with backtracking I need to eliminate left-recursion. The grammar in the question seems to be left recursive.

So am I right that I either need to refactor grammar or use predictive algorithm?

Max Koretskyi
  • 101,079
  • 60
  • 333
  • 488

2 Answers2

3

Yes, the grammar is left-recursive and thus not LL. Neither backtracking nor predictive LL-parsers can handle such a grammar. So you'd either need to change the grammar or use another algorithm such as an LR-parsing algorithm.

Note that this grammar is regular, so it can actually be translated to a regular expression or directly into a finite automaton.

When writing a real-world implementation of JavaScript the lexical rules such as this one would be handled in the lexer and only the other rules would be handled by the parser (however those a specified left-recursively as well, so they'd also have to be rewritten to be parsed by an LL-parser).

sepp2k
  • 363,768
  • 54
  • 674
  • 675
  • thanks, _this grammar is actually regular_ - how do you know that? _would be handled in the lexer_ - yes, this is how TS compiler implements it. _however those are specified left-recursively as well, so they'd also have to be rewritten to be parsed by an LL-parser_ - actually TS compiler [is implemented as recursive descent](https://github.com/Microsoft/TypeScript/issues/17824#issuecomment-325738135) and [from what I can tell](https://github.com/Microsoft/TypeScript/blob/94b4f8b79e370020cb31995e8fb0b78f9ba94349/src/compiler/parser.ts#L516) it is `predictive` so how did they do it? – Max Koretskyi Oct 13 '17 at 20:02
  • 1
    @AngularInDepth.com How I know the grammar is regular: A regular grammar is one where each production contains at most one non-terminal and the terminal is either always the left-most part of the production or always the right-most part. That's the case here. How the TS parser is implemented: They use while-loops rather than recursion to parse lists (which use left-recursion in the spec's grammar). In extended BNF notation that'd be equivalent to using the repetition operator, so you could say they rewrote `FooList :: FooList ',' Foo` to `FooList :: Foo ("," Foo)*` and then implemented that. – sepp2k Oct 13 '17 at 20:32
  • 1
    PS: If you look at the `tryParse` function, that implements backtracking, so the parser does backtrack. – sepp2k Oct 13 '17 at 20:48
  • appreciate your elaborate answers! so to conclude they indeed use top down recursive parser, however with backtracking, not predictive. Also, for some parts of the grammar that use left-recursion, like lists, they use `while-loops` rather than recursion. Is it correct? Also, can you maybe provide a reference to the `lists` grammar you're talking about? [This one](https://www.ecma-international.org/ecma-262/8.0/index.html#prod-StatementList)? – Max Koretskyi Oct 14 '17 at 09:31
  • 1
    @AngularInDepth.com Yes, that's correct. I wasn't talking about a specific grammar, but basically every one that has "list" in its name, including StatementList (though my FooList example used comma-separated lists, so things like ElementList, ArgumentList or FormalParameterList would be a closer match). Note that in the TS-parser all lists without separators use the `parseList` functions and all lists with separator the `parseDelimitedList` function, so it's all in one place (well, two places). – sepp2k Oct 14 '17 at 12:00
  • 1
    Another place with left-recursion are left-associative infix operators (AdditiveExpression, RelationalExpression etc.). Those are handled by the `parseBinaryExpressionOrHigher` function, which also uses while-loops (or rather the `parseBinaryExpressionRest` function that it calls does. PS: When looking through the grammar in the spec, you might want to go into appendix A, which has the whole grammar in one spot rather than looking at it in bits and pieces in the main document. – sepp2k Oct 14 '17 at 12:03
  • Much obliged for your elaborate answers and comments. That's so rare now on SO. Regarding parsing the arithmetic expressions I thought that TS uses either `precedence climbing` or `the shunting yard algorithm` described [here](http://www.engr.mun.ca/~theo/Misc/exp_parsing.htm). Is it the case? Or is it some other approach/algorithm that they use for `while-loops` I can read about? PS. Do you have an account on twitter? – Max Koretskyi Oct 14 '17 at 13:42
  • 1
    @AngularInDepth.com At a quick glance it looks like it's using precedence climbing in `parseBinaryExpressionRest`. No, I don't have twitter. – sepp2k Oct 14 '17 at 14:07
  • Got it, thanks, I'll investigate further. Good luck! – Max Koretskyi Oct 14 '17 at 14:15
0

This tends to be the work of the lexer, not the parser. Normally, a lexer proceeds one character at a time, in a loop, with a big switch statement (or the equivalent "initial character table" if data-driven.)

// near the end of the big "switch (ch) {" statement ...
default:
    if (!isIdentifierStart(chInit))
        {
        log(Severity.ERROR, ILLEGAL_CHAR, new Object[]{quotedChar(chInit)},
                lInitPos, source.getPosition());
        }
// fall through
case 'A':case 'B':case 'C':case 'D':case 'E':case 'F':case 'G':
case 'H':case 'I':case 'J':case 'K':case 'L':case 'M':case 'N':
case 'O':case 'P':case 'Q':case 'R':case 'S':case 'T':case 'U':
case 'V':case 'W':case 'X':case 'Y':case 'Z':
case 'a':case 'b':case 'c':case 'd':case 'e':case 'f':case 'g':
case 'h':case 'i':case 'j':case 'k':case 'l':case 'm':case 'n':
case 'o':case 'p':case 'q':case 'r':case 's':case 't':case 'u':
case 'v':case 'w':case 'x':case 'y':case 'z':
case '_':
    {
    while (source.hasNext())
        {
        if (!isIdentifierPart(nextChar()))
            {
            source.rewind();
            break;
            }
        }
    String name = source.toString(lInitPos, source.getPosition());
    // ...
    }

If building by hand, I find it far easier to have a dedicated lexer (producing tokens from a stream of chars) and parser (producing an AST from a stream of tokens) than to try to combine those into one parser.

cpurdy
  • 1,177
  • 5
  • 12