Bison get next possible tokens or determine which rule is being attempted

Question

I'm using bison to parse a lang from a spec I don't have control over it. In its definition there is a recursive rule and since the language uses indentation this resulted in reduce/reduce errors. So to get rid of the reduce/reduce I added a semi colon. I have a layout tracker which auto inserts semi colons and braces already. I was thinking of extending the rules for inserting semi colons to support the one I added that's not in the spec but I can't think of a way of knowing when we're at the end of the recursive rule.

Is there a reliable way of knowing when I'm at the end of the recursive rule or any suggestions on a different approach? Or as messy as it'd be is there some way to get two way comms between the parser and lexer?

Currently using a pull parser. I thought using a push parser would enable me to keep better track of where I am in the lexer but when I try the define directive to generate a push parser the option isn't recognised. I'm using bison 3.0.4 with a custom lexer, generating a pure parser with the C++ api.

EDIT:

exp                     : infixexp  TYPE_SEP  type                                                      {}                     
                        | infixexp  TYPE_SEP  context RIGHT_ARROW type                                  {}
                        | infixexp                                                                      {}

infixexp                : lexp qop infixexp                                                             {}    
                        | MINUS infixexp                                                                {}      
                        | lexp                                                                          {}

lexp                    : BACKSLASH apat_list FUNC_TYPE_CONS exp                                        {}   
                        | KW_LET decls KW_IN exp                                                        {}    
                        | KW_IF exp SEMI_COLON KW_THEN exp SEMI_COLON KW_ELSE exp                       {}    //conditional
                        | KW_CASE exp KW_OF L_BRACE alts R_BRACE                                        {}   
                        | KW_DO L_BRACE stmts R_BRACE                                                   {}   
                        | fexp SEMI_COLON                                                               {}

fexp                    : aexp                                                                          {}   
                        | fexp  aexp                                                                    {}  

literal                 : INTEGER                                                                       {}
                        | FLOAT                                                                         {}
                        | CHAR                                                                          {}
                        | STRING                                                                        {}

aexp                    : qvar                                                                          {}  
                        | gcon                                                                          {}   
                        | literal
                        | L_PAREN exp R_PAREN                                                           {}   
                        | L_PAREN exp_list R_PAREN                                                      {}   
                        | L_BRACKET exp R_BRACKET                                                       {}  
                        | L_BRACKET exp_list R_BRACKET                                                  {}  
                        | L_BRACKET exp DOTDOT R_BRACKET                                                {}  
                        | L_BRACKET exp DOTDOT exp R_BRACKET                                            {}  
                        | L_BRACKET exp COMMA exp DOTDOT exp R_BRACKET                                  {}  
                        | L_BRACKET exp PIPE qual_list R_BRACKET                                        {}  
                        | L_PAREN infixexp qop  R_PAREN                                                 {}  
                        | L_PAREN qop infixexp R_PAREN                                                  {}   
                        | qcon  L_BRACE fbind_list R_BRACE                                              {}   
                        | aexp  L_BRACE fbind_list R_BRACE                                              {}   
apat                    : qvar                                                                          {}  
                        | qvar AT_SYM  apat                                                             {}  
                        | gcon                                                                          {}  
                        | qcon  L_BRACE fpat_list R_BRACE                                               {}  
                        | literal                                                                       {}
                        | WILDCARD                                                                      {}
                        | L_PAREN pat R_PAREN                                                           {}  
                        | L_PAREN pat COMMA pat_list R_PAREN                                            {}  
                        | L_BRACKET pat_list  R_BRACKET                                                 {} 
                        | TILDE apat                                                                    {}

Added a section from the grammar. It's basically a modified version of the Haskell 2010 spec. The reduce/reduce conflicts were resolved by adding the semi colon after fexp in the definition of lexp.

I am simulating indent/dedents and inserting open and curly braces. And I was basically thinking of the lexer hack but couldn't figure out how to do it with Bison. And there are multiple recursive rules but only the one was causing reduce/reduce errors.

EDIT 2:

jsfiddle with the original reduce/reduce errors

I'm not at all sure I understand what your problems are. I don't understand where you added the semicolon and how that resolved some problems. A single recursive rule sounds like an unusual language; most languages end up with many recursive rules. You can have the parser provide feedback to the lexical analyzer; indeed, languages such as C, where `typedef` changes an identifier from being a mere identifier into an alias for a type name, is a classic example. That tends to be a fairly limited 'two-way communication' though. Maybe you should show us some code that represents your problem. — Jonathan Leffler, Dec 31 '15 at 03:52
Push parsers are only available with the C api. (https://www.gnu.org/software/bison/manual/html_node/_0025define-Summary.html#index-_0025define-api_002epush_002dpull-1 : "Language(s): C"). To parse indentation-aware languages, you usually need to synthesize INDENT and DEDENT tokens, which definitely is easier with a push interface, but it can be done with the pull interface and a small queue. — rici, Dec 31 '15 at 04:17
I don't think the so-called lexer hack is relevant here. Whether or not a line ending is necessary at that point is not semantic, so sharing semantic information like the symbol table between lexer and parser (though it is easy enough to do) is not going to help. I've never tried a haskell parser -- perhaps I should -- but it sounds somewhat similar to the javascript algorithm, which can be solved by constructing a dictionary of token pairs; if two consecutive tokens are separated by a newline, the pair is looked up in the dictionary and a semicolon inserted if the pair is found... — rici, Dec 31 '15 at 04:49
... if that helps, I can add it to the answer. (The dictionary of pairs is constructed at parser build time by analysis of FIRST and FOLLOW sets, with a small amount of manual intervention to cope with some other javascript rules.) — rici, Dec 31 '15 at 04:50

score 2 · Accepted Answer · edited May 23 '17 at 10:28

The usual way to handle indentation-aware languages is to fabricate INDENT and DEDENT tokens in the lexical scanner. That's easier with a push interface, so it's unfortunate that you are using the bison C++ API which does not implement that feature.

But it can also be done without too much trouble using a shim in between the lexical scanner and the parser. You can see an example of a Python shim in this answer; ply doesn't offer a push parser interface either, so the shim keeps a small persistent queue of tokens which will be sent to the parser and checks that queue before asking the real lexical scanner for the next token.

As that answer indicates, in most layout-aware languages not all newlines are actually semantically significant. For example, in Python itself a newline inside parentheses, braces or brackets is just ordinary whitespace. That rule can easily be implemented by the shim as well (although I didn't complicate the code by doing so in the linked answer), simply by tracking the level of parentheses.

Not all languages make life so easy; you may have a language in which indentation could reassert itself inside a bracketed list because of the presence of a function literal, for example. Or you may have a language like ecmascript where the semi-colon insertion rule allows run-on lines even outside of parentheses, if the alternative parse would not be possible. Haskell has a similar rule, where a brace can be inserted if an alternative parse would not be possible.

The ecmascript rule was drafted with a view to making it possible to write a parser. (Or, more accurately I think, the rule was drafted by reverse-engineering an existing parser, but I can't prove that.) As it turns out, it is possible to implement ecmascript automatic semi-colon insertion by constructing a dictionary of pairs of tokens which can be separated by a newline without a semicolon being inserted. (Or, alternatively, pairs of tokens which must have a semicolon inserted between them if possible, which is the inverse of the other set.) These sets can be constructed automatically by grammar analysis, using the FIRST and FOLLOW sets of each production. (The details of the ecmascript rules require a bit of adjustment because there are some token pairs which could appear in a valid program but which are not allowed to be separated by a newline. For example, return 3 is a valid statement, but if the return is at the end of a line and the 3 is on the following line, a semicolon must be automatically inserted'). Bison does not do this analysis automatically, so it depends on a custom tool, but it is not particularly difficult.

Haskell does not seem to be so accommodating. I see in the Haskell report, section 9.3, at the end of that section:

The parse-error rule is hard to implement in its full generality, because doing so involves fixities. For example, the expression
do a == b == c
has a single unambiguous (albeit probably type-incorrect) parse, namely
(do { a == b }) == c
because (==) is non-associative. Programmers are therefore advised to avoid writing code that requires the parser to insert a closing brace in such situations.

That's not very promising, but it also suggests that implementations are not expected to be perfect, and that programmers are kindly requested to not expect a perfect parser implementation :)

I think translating the IndentWrapper shim in the linked answer into C++ would not be difficult even for someone not too familiar with Python, so I haven't bothered doing it here. If that assumption is incorrect, let me know.

Interesting approach - I know Python, I'll have a go at this in the morning. — zcourts, Dec 31 '15 at 04:56
Read the python answer linked. Layout tracker I have already does that and quite a bit more to handle special cases. In my case I create a stack of `Block`s and record the token that created it, position, whether its inside a `do` (where layout rules change), if it's virtual (virtual block doesn't insert braces but expressions created in it do and when the virtual block ends it needs to close all the nested blocks within it e.g. inside lists,list comprehensions, or tuples which can have arbitrary expressions). The dictionary idea sounds promising though, I'll have a go after work today and see — zcourts, Dec 31 '15 at 13:47
@zcourts: Yeah, I took a closer look at the Haskell language spec and I'll rewrite my answer in light of that. But not right now :) The wording in the Haskell2010 report is more promising since it specifies roughly the same rule as ecmascript uses, i.e. automatic insertion only if the next token cannot extend a valid prefix, which as I read it means that you don't need to run the fixity algorithm to decide, making it actually potentially solvable. I don't know if you can reduce that condition to a small window, as you can in ecmascript, but it may well be "close enough". — rici, Dec 31 '15 at 20:54
@zcourts: Also, I'm pretty sure that the reduce/reduce conflict is not at all related to layout-aware parsing, so that was a bit of a red herring. Although it's probably my fault; I reread your question and I'm not sure whether or not you were assuming that; it seems that you are just looking for a similar way to insert a semicolon to avoid the R/R conflict. Anyway, I'll take another look later if I get a chance. — rici, Dec 31 '15 at 21:28
I just had a go at the dictionary approach, ended up with 89 permutations. The problem with the tokens in the dictionary is that many of those permutations also occur in a large number of other situations where ; isn't appropriate. Why do you suspect this was a red herring? I'll update the post with a pastebin to the debug info with the conflicts. — zcourts, Jan 02 '16 at 18:00
@zcourts: Because a shift/reduce conflict is (somewhat) expected. ("The grammar is ambiguous regarding the extent of lambda abstractions, let expressions, and conditionals. The ambiguity is resolved by the meta-rule that each of these constructs extends as far to the right as possible.") I still haven't read the entire report, the holidays got in the way. But that quote indicates an expected s/r conflict. My hunch is that the r/r conflict comes from something parsed as one non-terminal at the beginning of an expr and as a different one at the end. I know that's hand-wavy, and ... — rici, Jan 02 '16 at 19:14
... I might have some time to get more involved in it this weekend. — rici, Jan 02 '16 at 19:14

Bison get next possible tokens or determine which rule is being attempted

1 Answers1