2

I'm trying to write a Bison C++ parser for parsing JavaScript files, but I can't figure out how to make the semicolon optional.

As to ECMAScript 2018 specification (https://www.ecma-international.org/publications/files/ECMA-ST/Ecma-262.pdf, chapter 11.9), semicolon isn't actually optional, instead it is inserted automatically during the parsing. In the specification, it is stated that:

When, as the source text is parsed from left to right, a token (called the offending token) is encountered that is not allowed by any production of the grammar, then a semicolon is automatically inserted before the offending token if one or more of the following conditions is true:

  • The offending token is separated from the previous token by at least one LineTerminator[...]

According to this, I'm trying to solve this problem in this naive way:

  • Detect the error, using the error special token;
  • Tell the lexer that a syntax error occurred, during the action; if it has encountered a newline character before the current token, the lexer will return a new semicolon token at the next yylex call; at the subsequent call, it will return the token that previously was the offending one when the syntax error occurred.

A very simplified structure of my parser is like the following:

program:
   stmt_list END
;

stmt_list:
    %empty
 |  stmt_list stmt
 |  stmt_list error  { /* error detected; tell the lexer about the syntax error */ }
;

stmt:
    value SEMICOLON
|   [other types of statements...]
;

value:
    NUMBER
|   STRING
;

But doing this way, in case the file contains a valid JavaScript statement without a terminating semicolon, but a newline character, when an offending token is encountered, the parser reduces the rest of the statement into an error special token. As I tell the lexer about the syntax error, the parser has already reduced the error token into stmt_list one and the previous valid instruction is lost, making the semicolon insertion useless.

Obviously I don't want to let my parser discard the valid statement and go to the next one.

How can I make this possible? Is this the right approach or am I missing something?

Marco
  • 108
  • 1
  • 10

1 Answers1

1

I don't believe this approach is workable.

Just as a note, you would have to detect the error before any reduction takes place. So for semicolon insertion at the end of a statement, you need to add the error production to stmt, not stmt_list. So you would end up with something like this:

stmt_list
     :  %empty
     |  stmt_list stmt

stmt: value ';'   { handle_value_stmt(); }
    | value error { handle_value_stmt(); }
    | [other types of statements...]

That doesn't insert a semicolon; it just pretends that the semicolon was inserted. (If a semicolon couldn't be inserted, then another error will be triggered.)

But since it doesn't involve the lexer, it will happen whether or not the missing semicolon was at the end of a line, which is too enthusiastic. So the ideal solution would be to somehow tell the lexer to generate a semicolon token as the next token. But at the point where the error is detected, the lexer has already produced the lookahead token, and the parser knows what the lookahead token is. And it will use its recorded lookahead token to continue the parse.

There's also the question of how it is possible to communicate with the lexer at this point, since Mid-Rule Actions don't really play well with the error recovery algorithm. In theory, you could use the fact that yyerror will be called to report the error, but that means that yyerror needs to be able to deduce that this is not a "real" error, which means it will have to go poking into yyparse's guts. (I'm sure this is possible but I don't know how to do it off the top of my head, and it doesn't seem to me to be recommendable.)

Now, in theory it is possible to tell the parser to discard the lookahead token, and to tell the lexer to generate a semicolon followed by a repeat of the token it just sent. So it is just barely possible that by piling hack onto hack, you could make this work, if you're stubborn enough. But you'd end up with something very difficult to maintain, verify and test. (And making sure that it works in all corner cases will also be a challenge.)

And that's without looking at the other cases where semicolons could be inserted.

My approach to ASI was to simply analyse the grammar by figuring out which pairs of consecutive tokens are possible. (That's easy to do; you just need to construct FIRST and LAST sets, and then read through all the productions looking at consecutive symbols.) Then if the input consists of token A followed by one or more newlines followed by token B, and it is not possible for A to be followed by B in the grammar, then that's a candidate for semicolon insertion. The semicolon insertion might fail, but that will generate a syntax error, so you can't get a false positive. (You might have to fix the syntax error message, but at that point you at least know that you've inserted a semicolon.)

Proving that that algorithm works is trickier, because it could theoretically be the case that A could be followed by B in some context but that it is not possible in the current context, while A ; B would be possible in the current context. In that case, you might miss a possible semicolon insertion. I haven't looked in detail at recent JS versions, but long ago when I wrote a JS lexer, I managed to prove to my own satisfaction that there are no such cases.


Note: since the question was raised in a comment, I'll add a little hand-waving, although I really don't recommend following this approach.

Without diving into bison's guts, it's really not possible to "unshift" a token, including the error token (which is a real token, more or less). By the time the error token has been shifted, the parse is effectively committed to an error production. So if you want to annul the error, you have to accept that fact and work around it.

After an error token has been shifted, the parser will then skip tokens until a shiftable token is encountered. So if you've managed to insert an automatic semicolon into the token stream, you can use that token as a guard:

    stmt: value ';'       { handle_value_stmt(); }
        | value error ';' { handle_value_stmt(); }

However, you might not have managed to insert an automatic semi-colon, in which case you really need to report the syntax error (and maybe attempt to resynchronise). The above rules would just silently drop tokens up to the next semicolon, which is certainly wrong. So a first approximation would be for your ASI inserter to always insert something, which can be used as a guard in the error productions:

    stmt: value ';'       { handle_value_stmt(); }
        | value error ';' { handle_value_stmt(); }
        | value error NO_ASI { handle_real_error(); }

That's sufficient for "abort on error" processing, but if you want to do error recovery, you'll need to do some more hackery.

As I said, I really don't recommend going down this route. The end result won't be pretty, even if it works (and you still might find that code which you thought worked fails on real user input, in a case you didn't consider.)

rici
  • 234,347
  • 28
  • 237
  • 341
  • Actually I was able to instruct the lexer to go back in the file to return the previous token and a faked semicolon if a newline character is present between the two ones, but the problem was in unshifting the `error` token to recover the original statement. However, your approach is interesting. How do you generate the possible pairs of tokens, starting from the productions? I'm new to the parsers world so I'm sorry if some of my questions seems silly. – Marco Sep 27 '18 at 07:32
  • @Marco: It's not a silly question but it's a bit out of scope. The standard algorithm for computing nullability and FIRST sets is described in any parsing textbook, including online references (and a number of SO questions). There's a beautiful O(n) algorithm, but if you can't find it, the least-fixed-point algorithm will certainly be fast enough (it's worst-case O(N²) and N isn't huge here.) The LAST computation is similar; just work in the opposite direction. Once you have those three things, finding the possible pairs is just a simple scan over the productions... – rici Sep 27 '18 at 15:54
  • ... this being Javascript, you'll need to adjust by hand. First, ASI is not legal if it would result in two consecutive semi-colons, so you'll have to remove all such pairs from the list. Then, there are things like the newline-restrictions, which means that ASI is *always* possible after `return` at the end of a line (unless the first token on the next line is a semicolon); similarly, ASI is always possible before `++` and `--` at the beginning of a line. There are some others. – rici Sep 27 '18 at 15:56
  • Oh, and remember that ASI is not always at the end of a line. It also occurs before `}`. But that case could be handled easily enough in the grammar by making some semicolons optional. – rici Sep 27 '18 at 15:57
  • @Marco: I added a possible approach for handling the error token. But it's going to be ugly. – rici Sep 27 '18 at 16:09
  • I see the point in not recommending this approach. I'm trying to follow the suggested approach. Just a question: I'm not sure, but I think that the LAST set for a given production is the set of tokens that the given production can end with. So, why do I have to compute LAST sets if I need to get the possible pairs of tokens? A production could be made of three nonterminals and generating pairs from FIRST and LAST sets will exclude the second of the three nonterminals it is made of. – Marco Oct 01 '18 at 07:12
  • 1
    You exclude pairs which cannot appear. So you need to figure out the possible pairs of consecutive tokens. You find the the possible pairs by looking at the right hand side of each production and considering the possible consecutive symbols. (Two symbols are possibly consecutive if they are really consecutive or if they are separated only by nullable non-terminals.) Two figure out the token pairs which correspond to a symbol pair, you need the LAST set of the first symbol and the FIRST set of the second symbol. – rici Oct 01 '18 at 07:52
  • At the end of all that, any pair which not in the set of possible pairs is a candidate for ASI. – rici Oct 01 '18 at 07:54