3

I have implemented parser combinators, that can parse grammars that may contain ambiguity. An error is given when the grammar is ambiguous, but going in the other direction is proving to be more difficult. The question is how to pretty print an abstract syntax tree to a potentially ambiguous grammar with a minimal number of parentheses. Using operator precedence levels helps but is not a panacea. Inside the same precedence level, the problem persists.

The exact operators are not known until runtime and can change during execution when the user introduces a new operator. I have support for prefix, postfix, and infix (left, right, and non-associative) operators. Infix left and postfix operators mix at a precedence level at the same time. The same applies to infix right and prefix operators. The operators can also embed full expressions thus if-then-else and if-then could both be implemented as prefix operators. (although it might not be a smart move.)

Here is an example using the mentioned if-then-else and if-then operators, that are here assumed to be at the same precedence level. Obviously, the expression if a then if b then c else d is ambiguous as it can be interpreted as if a then (if b then c) else d or if a then (if b then c else d). During pretty-printing, the algorithm should know to use parentheses even though both operators are at the same precedence level and have compatible associativity (to the right).

A cautionary example: Add another prefix operator say inc of the same precedence as if-then-else and if-then. Now assume an arbitrary set P ⊂ H x O where H is the set of operator holes and O is the set of operators. The set is meant to be a relation that tells when parentheses need to be added. Examine the expressions if a then inc b else c and if a then (inc if b then c) else d. The first requires (if-then-else.2, inc) to not be in P and the second requires the opposite. This contradicts the assumption the problem can be solved by some kind of relation or order. One could try to say let (inc.1, if-then) be in P making the latter expression if a then inc (if b then c) else d, but then inc if a then b becomes inc (if a then b) which has too many parentheses.

To my knowledge, the grammar is context-free. I'm a little shaky on the definition though.

The parser is loosely based upon a paper here. I am using Haskell.

Update: As demonstrated by Maya, the problem is insolvable in general. I would be willing to accept an algorithm that may fail. If even that is not enough to make things practical, a good heuristic will do.

  • I guess what you want is to print an the AST for an unambiguous sentence recognised by an unambiguous grammar with the minimum number of parentheses without creating ambiguity. I don't know if that formulation is any easier to understand, but it took me a while to figure out what you wanted. – rici May 17 '18 at 16:11
  • Question: Is your parser effectively an operator-precedence parser? Or is it a more general CFG (or subset) which can be interpreted as having precedence? – rici May 17 '18 at 16:12
  • The parser is based upon a paper at http://www.cs.uu.nl/research/techreps/repo/CS-2008/2008-044.pdf – Topi Karvonen May 17 '18 at 16:20
  • It can parse pretty much any grammar, but like all (that I know of) parser combinator libraries can not handle recursion without consuming some input first. – Topi Karvonen May 17 '18 at 16:22
  • However the operator precedence parser is definitely the biggest and hardest part of the parser as well as the biggest source of ambiguity. It took a while to get right. – Topi Karvonen May 17 '18 at 16:25
  • I don't know what CFG means. – Topi Karvonen May 17 '18 at 16:25
  • Context Free Grammar – rici May 17 '18 at 16:26
  • As to what I mean here is an example using the mentioned if-then-else and if-then operators that are here assumed to be at the same precedence level. obviously the expression (if a then if b then c else d) is ambiguous as it can be interpreted as (if a then (if b then c) else d) or (if a then (if b then c else d)). During pretty printing the algorithm should know to use parentheses even though both operators are at the same precedence level and have compatible associativity (to the right). – Topi Karvonen May 17 '18 at 16:38
  • OK, that makes it clearer. Why don't you edit your question with that example (and maybe the reference to Doaitse's paper), rather than hiding it in a comment thread that no-one is going to read :-) – rici May 17 '18 at 16:39
  • To my understanding the parser can handle grammars other than CFG as it can parse things like [(]) where the parentheses and brackets must match. (Except my lexer already handles parens and brackets and can't handle mismatched parens, so...) – Topi Karvonen May 17 '18 at 17:08
  • The more powerful your grammar is, the harder it is to solve the "shortest unambiguous representation" problem in general. If context-senstivity is available, it may well be impossible to solve in polynomial time (but that's just a guess). – rici May 17 '18 at 18:09
  • Any conservative heuristic then? Also while context sensitivity is possible with the parser combinators I have implemented, if memory serves I have not actually used the feature. – Topi Karvonen May 17 '18 at 18:21
  • OK; I'll have a think. Or maybe someone else will jump in :) – rici May 17 '18 at 18:21

2 Answers2

2

In full generality, this is impossible. Consider the operators A_B, _C_, A_C_B. The expression A_C_B 1 2 (i.e. A 1 C 2 B) is impossible to parenthesize such that it cannot be parsed as A (1 C 2) B.

Maya
  • 1,490
  • 12
  • 24
1

You could construct a partial order relation of sorts between all operators based on their actual associativity and precedence as defined.

Because the precedence of operators depends on which position in the rule the recursion occurs (leftmost, in-the-middle, or rightmost) the relation should include which position of the parent node the precedence holds for.

Say the relation has type rel[Operator parent, int pos, Operator child].

Assuming you can generate this relation from the priority and associativity declarations as they are applied at run-time, then using this relation adding brackets during pretty printing is easy. If the tuple [parent, pos, child] is in the relation then you print brackets, otherwise not (or vice versa if the relation is inverted).

How to get this relation? There is example code here for Rascal's grammar formalism which generates it from relative priorities between operators: https://github.com/usethesource/rascal/blob/master/src/org/rascalmpl/library/lang/rascal/grammar/definition/Priorities.rsc

It starts from rules such as this:

E = left E "*" E  
  > left E "+" E
  ;

and produces something like:

{<"*", 0, "+">, <"*", 2, "+"> // priority between * and + 
,<"+", 2, "+">, <"*", 2, "*"> // left associativity of * and +
}

this table explains which nestings at which positions need extra brackets, so if a + is nested under a * at the 0th position, you'd need to print brackets

Suppose you have a precedence table instead which says:

0 * left
1 + left

or something in that vain, then a similar relation can be constructed. We have to generate a tuple for every i, j levels in the table where i < j. Of course you'd have to look up the rule for every operator to find out what the right positions are.

For these tables and relative priorities as in Rascal it is important to transitively close the relation, however some tuples must not be added if you don't want to generate too many brackets while pretty printing.

Namely, if the parent rule is right-most recursive and the child rule is left-most recursive, then a bracket is necessary. Also vice versa. But otherwise not. Consider this example:

E = "if" E "then "E"
  > E "+" E
  ;

In this case we do want brackets in the right-most hole, but not in the guarded hole between the "if" and the "then". Similar examples for indexing rules such as E "[" E "]", etc.

To make sure this works, you can compute which rules are right-most and which rules are left-most recursive first, and then filter the tuples from the transitive closure which are not ambiguous because they are not in ambigous positions.

So for the above example we'd generate this:

{<"if", 3, "+">, // and not <"if", 1, "+"> because the 1 position is not left-most recursive, even though the "+" is right-most recursive.
}

Papers on this topic, but they use the same relation for parsing and not for unparsing:

Jurgen Vinju
  • 6,393
  • 1
  • 15
  • 26
  • I'd use one relation for the precedence, one relation for (mutual) left assoc and one for right assoc. – Jurgen Vinju May 20 '18 at 08:16
  • If the precedence and assoc was enough to resolve ambiguity while parsing then it is now enough to introduce brackets to avoid the same ambiguity. This can also be used to print trees which are not constructed by parsing but for example by code generation or rewriting, as long as you know for every operator where in the partial order it belongs. – Jurgen Vinju May 20 '18 at 08:19
  • I'm confused how this works when embedding expressions in operators. Should there be a node for each operator hole? How would one generate such partial order from the operator table? – Topi Karvonen May 20 '18 at 12:38
  • It is not necessary for precedence relations to be transitive. If you limit the algorithm to transitive relations you will find that not all possible grammars can be handled. (Practical real-world examples exist.) – rici May 20 '18 at 16:45
  • Yes, some exceptions exist which can be modeled as extra tuples. – Jurgen Vinju May 21 '18 at 14:18
  • Each hole which is recursive to the expression type needs to be scrutinized the others don't. The ast needs to be able to distinguish different binary and unitary operators and the operator table needs to be prepared indeed. Only left recursive and right recursive holes have to be considered for adding brackets since otherwise there is already some kind of hedge. E.g in 1 + a[2+3] the first + needs brackets because it's in a left recursive position while the second plus is guarded by the square brackets – Jurgen Vinju May 21 '18 at 14:23
  • When you say left recursive or right recursive holes I assume you mean the first argument of infix left operator and postfix operator or the last argument of infix right operator or prefix operator. Does not the case (if a then (if b then c) else d) contradict this. The outer if needs parens inside it's second argument, but it's recursive argument by my definition is the last (third). Or did you mean recursive to the expression type as in where Expression type occurs is in the AST? That would be pretty much everywhere. – Topi Karvonen May 21 '18 at 18:06
  • I'm also still waiting to understand how to generate the partial order you mentioned. – Topi Karvonen May 21 '18 at 18:07
  • I just realized something. Add another prefix operator say **inc** of the same precedence as if-then-else and if-then. Now assume an arbitrary set P : H x O where H is the set of operator holes and O is the set of operators. Examine the expressions (if a then inc b else d) and (if a then (inc if b then c) else d) the first requires (if-then-else.2, inc) to not be in P and the second requires the opposite. This contradicts the assumption the problem can be solved by an order (pre, partial or otherwise) – Topi Karvonen May 21 '18 at 18:54
  • Your answer is incorrect. See updated question. Please update or delete your answer. – Topi Karvonen May 23 '18 at 13:43
  • Did you not read my updated question? I explicitly provided a counter example that proves a relation of operator holes and operators is not sufficient to decide when parentheses should be added. Your answer likely depends on assumptions not present in ambiguous grammar. – Topi Karvonen May 23 '18 at 18:40
  • I read it but I think this solution would not put a bracket around the entire inc expression but rather around the nested if-then under the inc. that would be consistent right? Perhaps I'm missing something. Please check. – Jurgen Vinju May 24 '18 at 19:51
  • There are by the way famous cases where this scheme can not work, for example in the ML language where some prefix unary operators have lower precedence than other binary and ternary operators. That makes the parent/child relation not powerful enough because the ambiguous child may be nested to deeply to easily see by the pretty printer or parser. – Jurgen Vinju May 24 '18 at 19:59
  • I'm finally catching up with you. Better late than never. Great example and I think you are right that a simple order applied to the parent child relation would not solve this one. Maybe we could make the pretty printer a little smarter, carrying more information down the stack as it's progressing down the tree. The fact that the grammar is context free says nothing about its unambiguous pretty printing. – Jurgen Vinju May 24 '18 at 22:09
  • The if-then-else Is particular because as opposed to precedence ambiguity, where two operators switch order on the tree, here we have also terminals moving from one rule to the other (the "if" occurs in more than one rule). So another way of looking at this is to see how your parser resolved this particular ambiguity and use that same info to solve it again here. It's not just precedence, there is also eagerness involved. – Jurgen Vinju May 24 '18 at 22:14
  • The short if-then statement must never be printed without brackets directly before an "else". That's not a local tree property. It means the rightmost recursive child of the second hole in an if then else, however deeply nested, if it is an if-then must have brackets. Otherwise not. – Jurgen Vinju May 24 '18 at 22:21
  • I guess adding an ast node for brackets as a transformation before pretty printing is best. This tree transformation can this task of carrying info down the tree, and it keeps the pretty printing to string code clean, free from this complexity. – Jurgen Vinju May 24 '18 at 22:26
  • The parser rejects all ambiguous grammar. I would rather not introduce parenthesis wrapper in the AST as this would even more tightly couple the pretty printer and the AST and I'm unconvinced it would measurably reduce the complexity of the pretty printer. – Topi Karvonen May 25 '18 at 07:43
  • I have been looking at how to detect potential sources of ambiguity from the pretty printer. No luck so far though. As you pointed out one can resolve the `if a then (if b then c) else d`, by noting -if-then is a prefix of if-then-else. The `if a then (if b then c else d)` case is giving me gray hair. Remember the actual operators are not known until runtime. – Topi Karvonen May 25 '18 at 07:43
  • Yep. It's a bad case because it's more than simple operator precedence. Is it conceivable to rewrite the grammar or should your system just for for this shape of any grammar? Or even change the language? – Jurgen Vinju May 26 '18 at 21:00
  • The actual grammar has not actually been formalized yet, so that isn't the issue. What constrains the grammar is the architecture of the parser which is designed to reject all ambiguous grammar. Also I would prefer it stays that way. Allowing ambiguous grammar with some greedy rule tacked on is just asking for bugs when the programmer misunderstands what he has written. Implementing a greedy rule is technically possible, but I don't see how I would integrate it with the operator precedence parsing function, that does all the heavy lifting. – Topi Karvonen May 27 '18 at 19:39
  • Also again the actual operators are not known until runtime so the users would have to specify all greedy rules needed, which is unpractical. – Topi Karvonen May 27 '18 at 19:49
  • I think I just pressed one of my buttons there. I quess it would be possible to limit the occurrence of opertor lex items to one operator, but that just seems too stifling. – Topi Karvonen May 27 '18 at 20:15
  • I find a specific remark contradictory until we clear it up. Perhaps solving this leads to a solution. If the parser does not accept ambiguity as you said how can a pretty printed string for the same language be ambiguous? Or is this a term definition issue? – Jurgen Vinju May 29 '18 at 13:35
  • Using the if-then-else and if-then operators as an example: the parser accepts `if a then (if b then c else d)` and `if a then (if b then c) else d`, but rejects `if a then if b then c else d` due to ambiguity. – Topi Karvonen May 31 '18 at 13:38
  • I perhaps use the term ambiguous grammar quite loosely. I either mean grammar that contains both ambiguous and unambiguous expressions or instance of grammar (an expression) that is ambiguous and thus is rejected by the parser. – Topi Karvonen May 31 '18 at 13:52
  • The difficulty arises from the need to both minimize the number of parentheses and still have enough we stay on the unambiguous side of the grammar. – Topi Karvonen May 31 '18 at 13:58
  • Ooo! that's a very cool design decision for a parser. Well, why don't you just include an AST node for the brackets then? Then they can simply be unparsed again. Or do AST nodes also come from other computations next to the parser? – Jurgen Vinju Jun 02 '18 at 21:59
  • the brackets won't be minimal, because they will be what the user typed (or perhaps this is another definition of "minimal"). Still, if desugarings or other rewrite rules produce AST nodes, we are stuck with the same old issue. – Jurgen Vinju Jun 02 '18 at 22:00
  • AST nodes can come from from other sources than the parser such as metaprogramming (feature that is probably quite far off) and type inference. – Topi Karvonen Jun 03 '18 at 08:51
  • So, for now, additional AST nodes for brackets will keep your pretty printing unambiguous, since your parser will never produce ambiguous trees. For the future, when metaprograms start producing trees with not enough brackets, I think you still have a really good open question! Or you might loosen the minimality constraint and accept a few extra brackets when the precedence relation can't remove the unnecessary ones. – Jurgen Vinju Jun 05 '18 at 08:31
  • I need the pretty printer now as I'm working on type inference and need to debug it. I would be willing to accept a good conservative heuristic, but I do not think the precedence relation is one. It would be hard to generate the the precedence relation. Maybe as hard as solving the problem in the first place. Introducing new operators could affect even non connected operators. – Topi Karvonen Jun 05 '18 at 16:11
  • I don't see how a bracket node could do harm. Maybe I'm wrong. A bracket node is just a node labeled "bracket", producing the same algebraic type as the single argument it wraps. – Jurgen Vinju Jun 06 '18 at 16:37
  • Make your parser produce these nodes instead of skipping over brackets, and you can start debuggihg the inferencer today. – Jurgen Vinju Jun 06 '18 at 16:38
  • Btw there is also a completely brainless solution which takes a lot more memory... This is to include all original whitespace and comments in the ast. Technically it's not an ast anymore then, more like a parse tree. We've used this solution for refactoring tools where it's important to retain the original layout of the source code as much as possible. It works great but the memory penalty is big. – Jurgen Vinju Jun 06 '18 at 16:42
  • The problem is the ast nodes produced by the type inferer won't have these so called bracket nodes. – Topi Karvonen Jun 07 '18 at 10:38
  • Ok. Too bad. I didn't see that one coming – Jurgen Vinju Jun 08 '18 at 14:30