0

I want to write a expression grammar which matches strings likes these:

words at the start ONE|ANOTHER wordAtTheEnd

---------^-------- ----^----- --^--
     A: alphas     B: choice  C: alphas

The issue is however that part A can contain the keyword "ONE" or "ANOTHER" from part B, so only the last occurrence of the choice keywords should match part B. Here an example: The string

ZERO ONE or TWO are numbers ANOTHER letsendhere

should be parsed into the fields

A: "ZERO ONE or TWO are numbers"
B: "ANOTHER"
C: "letsendhere"

With pyparsing I tried the "stopOn"-keyword for the OneorMore expression:

choice = pp.Or([pp.Keyword("ONE"), pp.Keyword("OTHER")])('B')
start = pp.OneOrMore(pp.Word(pp.alphas), stopOn=choice)('A')
end = pp.Word(pp.alphas)('C')
expr = (start + choice) + end

But this does not work. For the sample string I get the ParseException:

Expected end of text (at char 12), (line:1, col:13)
"ZERO ONE or >!<TWO are numbers ANOTHER text"

This makes sense, because stopOn stops on the first occurrence of choice not the last occurrence. How can I write a grammar which stops on the last occurrence instead? Maybe I need to resort to a context-sensitive grammar?

Community
  • 1
  • 1
halloleo
  • 9,216
  • 13
  • 64
  • 122
  • Problems with your grammar: `pp.Or(pp.Keyword("ONE"), pp.Keyword("OTHER"))` - Keyword("OTHER") will not match the "OTHER" in "ANOTHER", and Or takes a list of expressions, not 2 expressions. – PaulMcG Dec 19 '16 at 05:11
  • Yes, of course! Just slipped in when i generated the sample. Fixed in the question. Thanks. – halloleo Dec 19 '16 at 05:13

1 Answers1

1

Sometimes you have to try to "be the parser". What is it about the "last occurrence of X" that distinguishes it from other X'es? One way to say this is "an X that is not followed by any more X's". With pyparsing, you could write a helper method like this:

def last_occurrence_of(expr):
    return expr + ~FollowedBy(SkipTo(expr))

Here it is in use as a stopOn argument to OneOrMore:

integer = Word(nums)
word = Word(alphas)
list_of_words_and_ints = OneOrMore(integer | word, stopOn=last_occurrence_of(integer)) + integer

print(list_of_words_and_ints.parseString("sldkfj 123 sdlkjff 123 lklj lkj 2344 234 lkj lkjj"))

prints:

['sldkfj', '123', 'sdlkjff', '123', 'lklj', 'lkj', '2344', '234']
PaulMcG
  • 62,419
  • 16
  • 94
  • 130
  • Works with `Word` or `Keyword` expressions in the `OneOrMore(Or(...` part exactly as expected.Cool. – halloleo Dec 19 '16 at 06:21
  • I did have however have trouble with `CaselessKeyword` expressions: Then a `ParsingException` was generated. - I solved this by using the "weaker" `Keyword(.., caseless=true)` instead. – halloleo Dec 19 '16 at 06:24
  • And one question: Why do you call `SkipTo`? - Works for me just with `FollowedBy(expr)` too. Performance? – halloleo Dec 19 '16 at 06:33
  • `FollowedBy` does not look all the way through the rest of the string, just at the next immediate parse position. What you are describing is more like "last consecutive occurrence". In my example, this would stop at '123' in my test string, not go all the way to '234'. I don't see how `FollowedBy(expr)` works for you. – PaulMcG Dec 19 '16 at 06:58
  • Well, I had only two occurrences of the choice keyword in my sample strings, that';s why I didn't detect the difference. %-) – halloleo Dec 19 '16 at 07:16