1

In Rascal, why is it that when there is layout at the position of an optional part of a production, this causes ambiguity? E.g. "{ }" is ambiguous as Start1, while it parses fine as Start2 from the following grammar, which I would have expected to be exactly identical.

layout Layout                               = " "?;
start syntax Start1                         = "{" "c"? "}";
start syntax Start2                         = "{" "c" "}"
                                            | "{" "}";

In addition, I would like to know if there is another way to represent Start2 without duplication than Start1, that does not cause the same ambiguity.

Obviously there is no large amount of duplication in this code and Start2 is a good option here, but this is just an example. I am working with a grammar with many productions that contain three or four optional parts and in the last case the notation displayed in Start2 already requires duplicating the not-optional parts of the production 2^4=16 times, which really is troublesome in my opinion.

1 Answers1

1

Your grammar is first extended before a parser is generated to something similar to this:

layout Layout                         = " "?;
syntax " "?                           =  | " ";
syntax Start1                         = "{" Layout "c"? Layout "}";
syntax "c"?                           =  | "c";
lexical " "                           = [\ ];
lexical "c"                           = [c];
lexical "{"                           = [{];
lexical "}"                           = [}];
syntax Start2                         = "{" Layout "c" Layout "}"
                                      | "{" Layout "}";
syntax start[Start1] = Layout Start1 Layout;
syntax start[Start2] = Layout Start2 Layout;

So for an input like { } (space between the curlies), the space can be derived by the first instance of Layout in the right-hand side of the Start1 rule, or by the second instance of Layout. Since the parser produces all derivation trees, both in this case, the parse is ambiguous so to say.

Typically the ambiguity is solved by introducing greediness using a follow restriction like so:

layout Layout = " "? !>> " "

or (equivalently) like so:

layout Layout = " "? !>> [\ ]

The restriction acts as a constraint on the Layout rule: it will not derive anything (not even the empty string) if there is a space following it. This makes only the first derivation valid then, where the space goes inside the first Layout instance of Start1. After this there is } which satisfies the constraint and the parse is unambiguous.

Jurgen Vinju
  • 6,393
  • 1
  • 15
  • 26
  • Thanks for the detailed answer. It really made much more sense to me when I saw how the grammar is extended and the solution is easy to understand. Still feels a bit odd to me this is necessary 'though. Wouldn't it be useful to have a variant on the question mark (and asterisk) available for when you don't care about the location of layout? I can imagine this would be used quite a lot. – Olav Trauschke Jul 06 '16 at 14:58
  • Yes, we thought about that too; but an eager ? or * can easily lead to parse errors which are very unexpected and hard to debug. In that respect it's easier to fix an ambiguity than a parse error. Declarative disambiguation can also introduce parse errors, but at least its explicitly visible. Nevertheless, we are thinking of introducing eager semantics on the lexical level for regular token sub-languages while keeping the context-free part general. Future work! – Jurgen Vinju Jul 06 '16 at 16:27