4

In Python3.4, I'm using the re library (the regex library gives the same result), and I'm getting a result I don't expect.

I have a string s = 'abc'. I would expect the following regex:

re.match(r"^(.*?)(b?)(.*?)$", s).groups()

..to match with three non-empty groups, namely:

('a', 'b', 'c')

--because the middle part of the pattern is greedy (b?). Instead, only the last group is non-empty:

('', '', 'abc')

I get the same result with both of the following:

re.match(r"^(.*?)(b?)(.*?)$", s).groups()   #overt ^ and #
re.fullmatch("(.*?)(b?)(.*?)", s).groups()  #fullmatch()

If I make the first group be a greedy match, then the result is:

('abc', '', '')

Which I guess I'd expect, because the greedy .* is consuming the entire string before the other groups get to see it.

The regex I'm trying to build is, of course, more complicated than this, otherwise, I could just exclude the b from the left and right groups:

re.match(r"^([^b]*?)(b?)([^b]*?)$", s).groups()

But in my real use case, the middle group is a string several characters long, any of which might show up on their own in the left or right groups, so I can't just exclude those chars from the left or right groups.

I've looked at other questions tagged for , and none seems to answer this question, although I suspect that ctwheels' reply in python non-greedy match is behind my problem (the optionality of the first two groups prevents the regex engine from actually failing until it gets to the end of the string, and then it only has to backtrack a little ways to get a non-failing match).

Mike Maxwell
  • 547
  • 4
  • 11
  • `(.*?)` will match up to the next thing that might be matched. ` (b?)` will match nothing, and that's good enough to terminate the lazy match before the first character. Which is nothing. – tdelaney May 13 '18 at 00:05
  • Thanks, I now understand this better, thanks to Ahmed's answer, and I have implemented s.t. that resembles tdelaney's answer (mainly because I want to avoid lookahead, which I probably intuit less than I intuit lazy/greedy search). Which to choose for The Answer? I'm choosing Ahmed's, because while both solutions work, Ahmed explains the problem better. But thanks to both of you! – Mike Maxwell May 13 '18 at 02:02

4 Answers4

2

I would expect the following regex

re.match(r"^(.*?)(b?)(.*?)$", s).groups()

to match with three non-empty groups.. because the middle part of the pattern is greedy

No, you shouldn't expect that. Actually, this behavior is very expected for the following reason:

You specifically instructed the regex in the first group to be lazy, which means that it will accept the least amount of characters possible (which is zero in this case) because nothing else is forcing it to look for more. So, although the regex in the second group is greedy (i.e., b?), it still can't match the b because the position is still at 0.

You can confirm that by replacing your second group with (.?) which in that case will match the a, not the b like what you might expect. Here's a demo for ^(.*?)(.?)(.*?)$.

Now, if your rules were to disallow the absence of the b, you could've easily changed your regex to ^(.*?)(b)(.*?)$, but since you want the first group to continue matching if the b exists but at the same time, the b is allowed to be absent (i.e., the second group can actually be empty), then this solution doesn't solve the problem.

The only solution that comes to my mind at the moment that satisfies these two conditions is to use Lookahead to determine whether the b exists or not. Here's an example:

^((?:.*?(?=b))|.*?)(b?)(.*?)$

Try it online.

This will continue matching any characters (using the .) until it finds the b and then stops, otherwise (i.e., if there's no b), it'll stop matching whenever the least amount of characters possible are found (which is the original behavior). In other words, it will guarantee that the second group is not empty as long as the b exists.

Please let me know if this doesn't meet any of the conditions you have.

  • Good job. Nonetheless, there are other options using alternations, empty captures, etc, e.g. `^(?|(.*?)(b)|(.*?)())(.*?)$` (PCRE) but the suggested pattern is fine. – wp78de May 13 '18 at 00:46
0

Since the goal is to split the string into three parts based on a pattern in the middle, you could search for that pattern and use its start and end index to split the string yourself.

import re

def combo_finder(line):
    try:
        search = re.search("(foo|bar|baz)", line)
        start, end = search.start(1), search.end(1)
        return (line[:start], line[start:end], line[end:])
    except AttributeError:
        return (line, '', '')

test = ("afoob", "abarb", "afoo", "ab")

for s in test:
    print(s, combo_finder(s))

this test run gives

afoob ('a', 'foo', 'b')
abarb ('a', 'bar', 'b')
afoo ('a', 'foo', '')
ab ('ab', '', '')
tdelaney
  • 73,364
  • 6
  • 83
  • 116
0

Answering myself (although as I said in my comment, I chose Ahmed's answer as The Answer). Possibly this will help someone else. My solution resembles tdelaney's, but uses if/else instead of try/except, and gets the answer differently. Here's the code:

rxRX = re.compile("^(.*)(foo|bar|baz)(.*)$")
Match = rxRX.match(sLine)
if Match:
     return [G for G in Match.groups()]
else: #rxRX didn't match, so just return the input:
     return [sLine]
Mike Maxwell
  • 547
  • 4
  • 11
  • I like this solution. BTW, no need for `^` at the start with `re.match` or `$` at the end because of the .*. – tdelaney May 13 '18 at 04:42
0

You've got good answers but I'm going to be more specific about this requirement:

But in my real use case, the middle group is a string several characters long, any of which might show up on their own in the left or right groups, so I can't just exclude those chars from the left or right groups.

Whatever the middle group is, you could use a pattern to allow / disallow matching things while you are looking for them:

^((?:(?!GROUP2).)*)(GROUP2)((?:!GROUP2).)*)$

So in case of GROUP2 be b it is:

^((?:(?!b).)*)(b)((?:(?!b).)*)$

In regex world it's called a tempered dot.

Live demo

revo
  • 47,783
  • 14
  • 74
  • 117