Does the Peg.js engine backstep after a lookahead like regexs do?

Question

According to regular-expressions.info on lookarounds, the engine backsteps after a lookahead:

Let's take one more look inside, to make sure you understand the implications of the lookahead. Let's apply q(?=u)i to quit. The lookahead is now positive and is followed by another token. Again, q matches q and u matches u. Again, the match from the lookahead must be discarded, so the engine steps back from i in the string to u. The lookahead was successful, so the engine continues with i. But i cannot match u. So this match attempt fails. All remaining attempts fail as well, because there are no more q's in the string.

However, in Peg.js it SEEMS like the engine still moves passed the & or ! so that in fact it isn't a lookahead in the same sense as regexps but a decision on consumption, and there is no backstepping, and therefor no true looking ahead.

Is this the case?

(If so then certain parsearen't even possible, like this one?)

Josh Voigts · Accepted Answer · 2018-10-25T14:03:02.467

1

Lookahead works similar to how it does in a regex engine.

This query fails to match because the next letter should be 'u', not 'i'.

word = 'q' &'u' 'i' 't'

This query succeeds:

word = 'q' &'u' 'u' 'i' 't'

This query succeeds:

word = 'q' 'u' 'i' 't'

As for your example, try something along these lines, you shouldn't need to use lookaheads at all:

expression
    = termPair ( _ delimiter _ termPair )*

termPair
    = term ('.' term)? ' ' term ('.' term)?

term "term"
    = $([a-z0-9]+)

delimiter "delimiter"
    = "."

_ "whitespace"
    = [ \t\n\r]+

EDIT: Added another example per comments below.

expression
    = first:term rest:delimTerm* { return [first].concat(rest); }

delimTerm
    = delimiter t:term { return t; }

term "term"
    = $((!delimiter [a-z0-9. ])+)

delimiter "delimiter"
    = _ "." _

_ "whitespace"
    = [ \t\n\r]+

EDIT: Added extra explanation of the term expression.

I'll try to break down the term rule a bit $((!delimiter [a-z0-9. ])+).

$() converts everything inside to a single text node like [].join('').

A single "character" of a term is any character [a-z0-9. ], if we wanted to simplify it, we could say . instead. Before matching the character we want to lookahead for a delimiter, if we find a delimiter we stop matching that character. Since we want multiple characters we do the whole thing multiple times with +.

It think it's a common idiom in PEG parsers to move forward this way. I learned the idea from the treetop documentation for matching a string.

edited Oct 25 '18 at 14:03

answered Oct 19 '18 at 19:06

Josh Voigts

4,114
1
18
43

The crux, however, is having the delimiting value includable in the term. So, if the delimiter is " . " (space dot space) can you also have terms like: "term1.a term1.b". An example expression would be "term1.a term1.b . term2.a term2.b" For a human, this is readily parsable, the space-surrounded dot is the clear delimiter, but Peg.js do it is the question. – TheRealWinnebagoMan Oct 22 '18 at 01:26
1

Both of those examples parse in the answer above. I'm not sure what you are asking about specifically. – Josh Voigts Oct 22 '18 at 14:33
Sorry for the lack of clarity. It's not dynamic though. The term should be any combination of any amount of letter, dots, spaces. Then the space-dot-space delimiter and another term like that. The spaces in the terms of your example are not dynamic. So as a term the following should also be allowable: "a.1 b cde.1 cde2 . this entire sentence including its ending period is a term too." There are two terms and one delimiter after 'cde2'. Does that make more sense? – TheRealWinnebagoMan Oct 23 '18 at 20:00
1

I took another stab at it in the **EDIT** above. You can use a negative lookahead for `delimiter` within your `term` to avoid being overly greedy when grabbing periods. – Josh Voigts Oct 24 '18 at 14:02
Wow. I tried to do this 1000 times. Thanks for proving that it was me, not the parser and for giving me something to study. Incredible! – TheRealWinnebagoMan Oct 24 '18 at 23:12
Can you add a walk-though of the term rule? EDIT: how did you come up with this? – TheRealWinnebagoMan Oct 24 '18 at 23:15
Hope that helps, I added a little more info. – Josh Voigts Oct 25 '18 at 14:03

Does the Peg.js engine backstep after a lookahead like regexs do?

1 Answers1

Linked