0

I need regex that will find the word when in all these sentences and any similar iteration.

  • "This is that." When did it happen? (ending in quotes/or FN call)
  • This is that. When did it happen? (note quotes are gone)
  • This is that. When did it happen? (notice the double space)
  • This is that. when did it happen? (notice the lowercase w)
  • This is that? When did it happen? (notice the question mark)

This code will match on the first iteration: (?<=\.\".)[a-zA-Z]*?(?=\s)

I'm mostly confused by the fact that my testing programs don't seem to let me use quantifiers or other modifiers within the look-back text. For example, I could do something like:

(?<=((\.)|(\!)|(\?))\"{0,1}\s{1,2})[a-zA-Z]*?(?=\s)

My problems with that text are:

1) It simply doesn't seem to process.

2) It doesn't seem like there is any easy way to make the quantifiers within the look-back lazy. In other words, even if it was processing, I'm not sure how it would make sense of (?<=((\.)|(\!)|(\?))\"{0,1}\s{1,2}?)[a-zA-Z]*?(?=\s)

3) I added the excessive parentheticals because I find it easier to read, but i'm not getting results w/ or w/o them. So they aren't the issue. As an aside, would they be an issue?

Mazdak
  • 105,000
  • 18
  • 159
  • 188
ideasandstuff
  • 47
  • 1
  • 7

3 Answers3

0

Since re module won't support variable length lookbehind, you could do capturing the string you want.

(?:[.!)?])\"?\s{1,2}([a-zA-Z]+)(?=\s)

DEMO

>>> s = '''"This is that." When did it happen? (ending in quotes/or FN call)
This is that. When did it happen? (note quotes are gone)
This is that.  When did it happen? (notice the double space)
This is that. when did it happen? (notice the lowercase w)
This is that? When did it happen? (notice the question mark)'''
>>> re.findall(r'(?:[.!)?])\"? {1,2}([a-zA-Z]+)(?=\s)', s)
['When', 'When', 'When', 'when', 'When']
Avinash Raj
  • 172,303
  • 28
  • 230
  • 274
0

Since variable length lookbehind is not allowed with the re module, you can build an alternation of lookbehinds with fixed length:

p = re.compile(r'(?:(?<=[.?!"]\s\s)|(?<=[.?!"]\s))[a-z]+', re.IGNORECASE)
Casimir et Hippolyte
  • 88,009
  • 5
  • 94
  • 125
0

Just because you can write complicated, inflexible RegExes does not mean that you have to ;-)

Use \w to match word character and \s* to match any number of whitespaces.

Apart from also matching the first word after the "opening" double quotes, this should get you started: (?:[.!?"]\s*)(\w+)

I'm sure the quote thing can also be fixed.

mkm13
  • 141
  • 11