1

I am trying to write a PEGjs rule to convert

Return _a_b_c_.

to

Return <>a_b_c</>.

My grammar is

root = atoms:atom+
{ return atoms.join(''); }

atom = variable
     / normalText

variable = "_" first:variableSegment rest:$("_" variableSegment)* "_"
{ return '<>' + first + rest + '</>'; }

variableSegment = $[^\n_ ]+

normalText = $[^\n]

This works for

Return _a_b_c_ .

and

Return _a_b_c_

but something is going wrong with the

Return _a_b_c_.

example.

I can't quite understand why this is breaking, and would love an explanation of why it's behaving as it does. (I don't even need a solution to the problem, necessarily; the biggest issue is that my mental model of PEGjs grammars is deficient.)

Domenic
  • 110,262
  • 41
  • 219
  • 271
  • 2
    `_.` matches the `"_" variableSegment` part, but then the trailing `"_"` is missing, that's why the `normalText` rule is used. (`Return _a_b_c_.` is _the same_ as `Return _a_b_c_d`) – t.niese Sep 14 '14 at 15:45
  • Why doesn't it see that the rule isn't going to match when interpreted that way, and instead interpret the last _ as the trailing _? – Domenic Sep 14 '14 at 15:51
  • When it fails upon seeing no trailing `_` after the `.`, it backtracks by assuming that the *leading* `_a` are "normal". I'm not sure I know why. – Pointy Sep 14 '14 at 16:02
  • Oh - let me edit my answer again if you don't want that trailing `_` – Pointy Sep 14 '14 at 16:39

2 Answers2

1

Rearranging the grammar slightly makes it work:

root = atoms:atom+
{ return atoms.join(''); }

atom = variable
     / normalText

variable = "_" first:$(variableSegment "_") rest:$(variableSegment "_")*
{ return '<>' + first + rest + '</>'; }

variableSegment = seg:$[^\n_ ]+

normalText = normal:$[^\n]

I'm not sure I understand why, exactly. In this one, the parser gets to the "." and matches it as a "variableSegment", but then backtracks just one step in the greedy "*" lookahead, decides it's got a "variable", and then re-parses the "." as "normal". (Note that this picks up the trailing _, which if not desired can be snipped off by a hack in action, or something like that; see below.)

In the original version, after failing because of the missing trailing underscore, the very next step the parser takes is back to the leading underscore, opting for the "normal" interpretation.

I added some action code with console.log() calls to trace the parser behavior.

edit — I think the deal is this. In your original version, the parse is failing on a rule that's of the form

expr1 expr2 expr3 ... exprN

The first sub-expression is the literal _. The next is for the first variable segment. The third is for the sequence of variable expressions preceded by _, and the last is the trailing _. While working through that rule on the problematic input, the last expression fails. The others have all succeeded, however, so the only place to start over is at the alternative point in the "atom" rule.

In the revised version, the parser can unwind the operation of the greedy * by one step. It then has a successful match of the third expression, so the rule succeeds.

Thus another revision, closer to the original, will also work:

root = atoms:atom+
{ return atoms.join(''); }

atom = variable
     / normalText

variable = "_" first:variableSegment rest:$("_" variableSegment & "_")* "_"
{ return '<>' + first + rest + '</>'; }

variableSegment = $[^\n_ ]+

normalText = $[^\n]

Now that greedy * group will backtrack when it fails in peeking forward at an _.

Pointy
  • 405,095
  • 59
  • 585
  • 614
  • That's because the rule is not _ambiguous_ anymore. Eiter it will find another `variableSegment "_"` or the variable is finished (it is the last part in that rule). PEG.js does not go a step backwards inside of the `variable`. It tries to match it step by step, either with a fulfillment (in your case) or a failure in the original code. (Sorry for the bad explanation) – t.niese Sep 14 '14 at 16:27
  • @t.niese yes that's what I meant in the portion I just edited in. Once one of the subexpressions matches, it won't backtrack within the list. I don't have a deep understanding of *why* but I assume it's just the semantics of the meta-grammar. – Pointy Sep 14 '14 at 16:29
  • I think it is defined that way for PEG.js to avoid situation where it isn't possible anymore to determine why a certain portion of the text will match the grammar. The more flexible parser is the hard it will be to debug the rules. – t.niese Sep 14 '14 at 16:35
  • @t.niese yes, it makes sense for it to be reluctant to backtrack that aggressively since it's possible to be more explicit in the grammar anyway. – Pointy Sep 14 '14 at 16:36
  • This is super-helpful. I made a mistake in my original; the output should be `Return <>a_b_c>` (no trailing `_`). I edited my OP, and you might want to edit your answer so as to not confuse future people, but your dissection of the problem helps immensely. – Domenic Sep 14 '14 at 16:39
  • In particular the `&"_") "_"` trick is great to have in my repertoire. – Domenic Sep 14 '14 at 16:40
  • I love questions about PEG.js because I have no need for it in my day-to-day life :) – Pointy Sep 14 '14 at 16:41
0

The parser interprets the last _. as variableSegment. If you exclude the the dot from the variableSegment RegExp your code will work as expected.

fgnass
  • 16