Python re.sub() is not replacing every match

Question

I'm using Python 3 and I have two strings: abbcabb and abca. I want to remove every double occurrence of a single character. For example:

abbcabb should give c and abca should give bc.

I've tried the following regex (here):

(.)(.*?)\1

But, it gives wrong output for first string. Also, when I tried another one (here):

(.)(.*?)*?\1

But, this one again gives wrong output. What's going wrong here?

The python code is a print statement:

print(re.sub(r'(.)(.*?)\1', '\g<2>', s)) # s is the string

Explain the logic behind the results you want. Are you saying that if there are an even number of occurrences of the character, then you don't want it at all, and if there are an odd number, you want exactly one in the output? Do you actually care about the output order, or do you just want to know which characters have an odd number of occurrences ? — Karl Knechtel, Dec 15 '18 at 08:23
What _exactly_ do you mean to "every double occurrence"? "All characters which occur more than once in the string"? "All characters with a neighbor of the same value"? — yeputons, Dec 15 '18 at 08:23
@KarlKnechtel You're right. I want just one if the repetition is odd. And, the order is optional. — vrintle, Dec 15 '18 at 08:30
So that we're clear: putting both inputs together, `abbcabbabca`, should give `b` (since the two `c`s cancel), not `cbc`? — Karl Knechtel, Dec 15 '18 at 08:37
Well, I tried to guess something else :) but if that really is the problem you're trying to solve, then regexes are really not what you want and @jon has it right. — Karl Knechtel, Dec 15 '18 at 08:41
clearly a case where you had a problem, you're using regexes and now you have 2 problems — Jean-François Fabre, Dec 15 '18 at 08:47

score 3 · Answer 1 · answered Dec 15 '18 at 08:36

3

It can be solved without regular expression, like below

>>>''.join([i for i in s1 if s1.count(i) == 1])
'bc'
>>>''.join([i for i in s if s.count(i) == 1])
'c'

answered Dec 15 '18 at 08:36

JON

1,668
2
15
18

This looks for unique characters; for characters with an odd count, modify the condition for the `.count` check accordingly (`s.count(i) % 2 == 1` for example). – Karl Knechtel Dec 15 '18 at 08:40
sure @KarlKnechtel, i will modify it soon Thanks for notifying :) – JON Dec 15 '18 at 08:42
works, but using count repeatedly in a list comprehension loops over all elements each time: o(n**2). Gave me the idea to answer myself – Jean-François Fabre Dec 15 '18 at 08:44

Barmar · Answer 2 · 2018-12-15T08:35:16.733

2

re.sub() doesn't perform overlapping replacements. After it replaces the first match, it starts looking after the end of the match. So when you perform the replacement on

abbcabb

it first replaces abbca with bbc. Then it replaces bb with an empty string. It doesn't go back and look for another match in bbc.

If you want that, you need to write your own loop.

while True:
    newS = re.sub(r'(.)(.*?)\1', r'\g<2>', s)
    if newS == s:
        break
    s = newS
print(newS)

DEMO

edited Dec 15 '18 at 08:35

answered Dec 15 '18 at 08:23

Barmar

741,623
53
500
612

On reflection, I *think* how OP wants it to work is that the entire `abbcabb` is matched by the regex: an opening `abb`, a single character, and then a closing `abb` which matches the opening pattern. – Karl Knechtel Dec 15 '18 at 08:27
1

@KarlKnechtel I disagree, OPs group is only one character long. That cannot match `abb` – Nick Dec 15 '18 at 08:28

Jean-François Fabre · Answer 3 · 2018-12-15T09:51:57.107

Regular expressions doesn't seem to be the ideal solution

they don't handle overlapping so it it needs a loop (like in this answer) and it creates strings over and over (performance suffers)
they're overkill here, we just need to count the characters

I like this answer, but using count repeatedly in a list comprehension loops over all elements each time.

It can be solved without regular expression and without O(n**2) complexity, only O(n) using collections.Counter

first count the characters of the string very easily & quickly
then filter the string testing if the count matches using the counter we just created.

like this:

import collections

s = "abbcabb"

cnt = collections.Counter(s)

s = "".join([c for c in s if cnt[c]==1])

(as a bonus, you can change the count to keep characters which have 2, 3, whatever occurrences)

Yes, this general approach is best in the long run - hence the hint in my answer. (Un?)fortunately on modern machines it takes fairly long strings for this to become noticeable :) — Karl Knechtel, Dec 15 '18 at 09:40

score 1 · Answer 4 · answered Dec 15 '18 at 08:36

EDIT: based on the comment exchange - if you're just concerned with the parity of the letter counts, then you don't want regex and instead want an approach like @jon's recommendation. (If you don't care about order, then a more performant approach with very long strings might use something like collections.Counter instead.)

My best guess as to what you're trying to match is: "one or more characters - call this subpattern A - followed by a different set of one or more characters - call this subpattern B - followed by subpattern A again".

You can use + as a shortcut for "one or more" (instead of specifying it once and then using * for the rest of the matches), but either way you need to get the subpatterns right. Let's try:

>>> import re
>>> pattern = re.compile(r'(.+?)(.+?)\1')
>>> pattern.sub('\g<2>', 'abbcabbabca')
'bbcbaca'

Hmm. That didn't work. Why? Because with the first pattern not being greedy, our "subpattern A" can just match the first a in the string - it does appear later, after all. So if we use a greedy match, Python will backtrack until it finds as long of a pattern for subpattern A that still allows for the A-B-A pattern to appear:

>>> pattern = re.compile(r'(.+)(.+?)\1')
>>> pattern.sub('\g<2>', 'abbcabbabca')
'cbc'

Looks good to me.

Paritosh Singh · Accepted Answer · 2018-12-15T08:52:35.490

0

The site explains it well, hover and use the explanation section.

(.)(.*?)\1 Does not remove or match every double occurance. It matches 1 character, followed by anything in the middle sandwiched till that same character is encountered again.

so, for abbcabb the "sandwiched" portion should be bbc between two a

EDIT: You can try something like this instead without regexes:

string = "abbcabb"
result = []
for i in string:
    if i not in result:
        result.append(i)
    else:
        result.remove(i)
print(''.join(result))

Note that this produces the "last" odd occurrence of a string and not first.

For "first" known occurance, you should use a counter as suggested in this answer . Just change the condition to check for odd counts. pseudo code(count[letter] %2 == 1)

edited Dec 15 '18 at 08:52

answered Dec 15 '18 at 08:23

Paritosh Singh

6,034
2
14
33

Thanks! So, the regex actually skips to look again at `bbc` when it removes the `a`'s. – vrintle Dec 15 '18 at 08:27
bingo. or more precisely, the re.sub should be thought of as two steps. the regex first matches everything it can in one go on the entire string, which means `abbca` and `bb`, and only then the replacement step happens. @rv7 – Paritosh Singh Dec 15 '18 at 08:29
I'm using regex because the length of string is guaranteed to be below 50. So, I thought looping wouldn't be the proper way to handle them. – vrintle Dec 15 '18 at 08:33
regexes have to effectively scan or loop through your text under the wraps anyways. You can have a much more reliable output avoiding regexes in this setup if your goal is just to remove occurance pairs. @rv7 – Paritosh Singh Dec 15 '18 at 08:35

Python re.sub() is not replacing every match

5 Answers5