I am trying to create a regular expression that will take strings and break them up into three groups: (1) Any one of a specific list of words at the beginning of a string. (2) Any one of specific list of words at the end of a string. (3) all of the letters/whitespace in between these two matches.
As an example, I will use the following two strings:
'There was a cat in the house yesterday'
'Did you see a cat in the house today'
I would like the string to be broken up into capture groups so that the match object m.groups()
will return the following for each string respectively:
('There', ' was a cat in the house ', 'yesterday')
('Did', ' you see a cat in the house ', 'today')
Originally, I came up with the following regex:
r = re.compile('^(There|Did) ( |[A-Za-z])+ (today|yesterday)$')
However this returns:
('There', 'e', 'yesterday')
('Did', 'e', 'today')
So it's only giving me the last character matched in the middle group. I learned that this doesn't work because capture groups will only return the last iteration that matched. So I put parentheses around the middle capture group as follows:
r = re.compile('^(There|Did) (( |[A-Za-z])+) (today|yesterday)$')
But now, although it does at least capture the middle group, it is also returning an extra "e" character in m.groups()
, i.e.:
('There', 'was a cat in the house', 'e', 'yesterday')
... although I feel like this has something to do with backtracking, I can't figure out why it is happening. Could someone please explain to me why I am getting this result, and how I can get the desired results?