Python regular expression returning extra capture group for last character matched

Question

I am trying to create a regular expression that will take strings and break them up into three groups: (1) Any one of a specific list of words at the beginning of a string. (2) Any one of specific list of words at the end of a string. (3) all of the letters/whitespace in between these two matches.

As an example, I will use the following two strings:

'There was a cat in the house yesterday'
'Did you see a cat in the house today'

I would like the string to be broken up into capture groups so that the match object m.groups() will return the following for each string respectively:

('There', ' was a cat in the house ', 'yesterday')
('Did', ' you see a cat in the house ', 'today')

Originally, I came up with the following regex:

r = re.compile('^(There|Did) ( |[A-Za-z])+ (today|yesterday)$')

However this returns:

('There', 'e', 'yesterday')
('Did', 'e', 'today')

So it's only giving me the last character matched in the middle group. I learned that this doesn't work because capture groups will only return the last iteration that matched. So I put parentheses around the middle capture group as follows:

r = re.compile('^(There|Did) (( |[A-Za-z])+) (today|yesterday)$')

But now, although it does at least capture the middle group, it is also returning an extra "e" character in m.groups(), i.e.:

('There', 'was a cat in the house', 'e', 'yesterday')

... although I feel like this has something to do with backtracking, I can't figure out why it is happening. Could someone please explain to me why I am getting this result, and how I can get the desired results?

Thank you for actually including a couple of attempted solutions in your question. There are far too many regex questions that don't show any work, and this isn't one of them. So thank you for that. — skrrgwasme, Nov 27 '15 at 22:40
No problem - I think it's always useful for other folks who might be having a similar problem to see what *doesn't* work (and why) as well as knowing how to do it properly. Anyways, thanks for your answer below. — J. Taylor, Nov 27 '15 at 22:45

score 1 · Answer 1 · answered Nov 27 '15 at 22:35

You can simplify your current regex, and get the correct behavior, by replacing your middle capture group with the . (dot) operator that will match any character, followed by the * (asterisk) operator to repeatedly match any character:

import re

s1 = 'There was a cat in the house yesterday'
s2 = 'Did you see a cat in the house today'

x = re.compile("(There|Did)(.*)(today|yesterday)")
g1 = x.search(s1).groups()
g2 = x.search(s2).groups()

print(g1)
print(g2)

Produces this output:

('There', ' was a cat in the house ', 'yesterday')
('Did', ' you see a cat in the house ', 'today')

score 1 · Answer 2 · answered Nov 27 '15 at 22:36

A repeated capturing group will only capture the last iteration. Put a capturing group around the repeated group to capture all iterations or use a non-capturing group instead if you're not interested in the data.

source https://regex101.com/

And here is the re working as expected:

^(There|Did) ([ A-Za-z]+) (today|yesterday)$

score 1 · Accepted Answer · answered Nov 27 '15 at 22:37

 r = re.compile('^(There|Did) (( |[A-Za-z])+) (today|yesterday)$')
                               ^ ^        ^

you have some unnecessary stuff. Take those out and include spaces in your middle group:

r = re.compile('^(There|Did) ([A-Za-z ]+) (today|yesterday)$')
                                     ^ space

EXAMPLE:

>>> r = re.compile('^(There|Did) ([A-Za-z ]+) (today|yesterday)$')
>>> r.search('There was a a cat in the hosue yesterday').groups()
('There', 'was a a cat in the hosue', 'yesterday')

Also, take out the spaces in between your capture group if you want the spaces to be a part of your middle (2nd) group

Python regular expression returning extra capture group for last character matched

3 Answers3