0

I am using the pegjs parser generator for a project and I am having difficulty creating a grammar that should match all words up until a collection of words that it should not match. as an example in the string "the door is yellow" I want to be able to match all words up until is, tell the pegjs parser to start parsing from the word is. The collection of words I want to the parser to break on are "is" "has" and "of".

current grammar rule is as follows:

subject "sub" = 
s:[a-zA-Z ]+ { return s.join("").trim()}

How can i create a look ahead that stops the parser from including my collection on words?

(!of|is|has)
Tyler Oliver
  • 1
  • 1
  • 4

2 Answers2

1

I know this question was asked 5 years ago, but I'm just running through cleaning up unanswered questions in the [pegjs] tag.

This seems to work, and you just need to replace postfix with your further processing rule.

subject "sub" =  prefix:prefix breakWord:breakWord postfix:postfix "\n"? {
  return { prefix: prefix, breakWord, postfix }
}

prefix = $(!breakWord .)* { return text().trim() }
postfix = [^\n]* { return text().trim() }

breakWord
  = "is"
  / "has"
  / "of"

which generates this with an input of "the door is yellow":

{ prefix: "the door", breakWord: "is", postfix: "yellow" }

Note a couple of things:

  • The form (!breakWord .) is a little slow; it looks ahead to make sure the current input doesn't begin with any of the words in the breakWord set of alternates -- for each character in the prefix.
  • If you have break words that start with a common set of characters (e.g. "is" and "isn't"), make sure the longer word is first in the breakWord rule.
  • The current postfix rule assumes that a newline might terminate the input.
Joe Hildebrand
  • 10,354
  • 2
  • 38
  • 48
-1

This will work

.+(?=\s+(of|is|has))

It matches one or more of any characters (except line breaks) until it encounters either 'of', 'is', or 'has' (via a positive lookahead) with white space before them.

RedLaser
  • 680
  • 1
  • 8
  • 20