Numerical reference for backreference not working out in Python

Question

I was trying to deal with difflib matches that return double word place names when only one of the words has been used to make the match. That is: when I do the difflib regex substitution I get a double up of the second word.

Approach:

capture substrings so repeated word is last/first of substrings
if words the same then remove first word & substitute for this 'everything after' substring

I don't understand the output I am getting using Python backreferences.

# removeDupeWords.py    --- test to remove double words eg "The sun shines in,_Days_Bay Bay some of the time"

import re

testString = "The sun shines in,_Days_Bay Bay some of the time"

# regex to capture comma to space of testString e.g ',_Days_Bay'
refRegex = '(,\S+)'

# regex to capture everything after e.g 'Bay some of the time'
afterRegex = '(,\S+)(.*)'

refString = re.search(refRegex, testString).group(0)
# print(refString)

afterString = re.sub(afterRegex, r'\2', testString)
print(afterString)

The output for r'\0', r'\1' & r'\2' is as follows:

The sun shines in
The sun shines in,_Days_Bay
The sun shines in Bay some of the time

I just want ' Bay some of the time' The docs Regular Expression HOWTO don't go into backreferences in much detail. I couldn't get enough info to offer any explanation why I would even get any output for r'\0'

If all you wanted was `' Bay some of the time'`, why are you performing a substitution operation in the first place? And what does difflib have to do with this? I don't think difflib even has any regex functionality. — user2357112, Dec 14 '22 at 03:58
I mentioned difflib to give some context as to how I've got here. This backreference issue is a small part of what I'm doing. Sometimes these kind of things are lost on some people. — Dave, Dec 14 '22 at 04:11
Whitespace is backslash-lowercase-s. Backslash-uppercase-S is 'not whitespace'. Nothing up to comma matches, so it is unaltered. The comma and everything up to next space goes into group 1, and remainder goes into group 2. — Chris Maurer, Dec 14 '22 at 04:49
Regex substitution returns the non-matching part of string unaltered! You did not get any match for \0. — Chris Maurer, Dec 14 '22 at 04:53
You may be right @ChrisMaurer. But my output does not back up your claim " The comma and everything up to next space goes into group 1" e.g `The sun shines in,_Days_Bay` My regex begins at the comma. See https://regex101.com/r/2Yt5FT/1 — Dave, Dec 14 '22 at 06:39

score 1 · Accepted Answer · answered Dec 14 '22 at 07:09

Let's try this again.

You are using re.sub, which only messes with the part of the string that actually matches your regex. So your regex divides your original string into three parts: The sun shines in, which does not match your regex at all and will not be replaced by anything, ,_Days_Bay which matches the first parenthesized group (,\S+) and goes into \1, and the rest of the string, Bay some of the time, which matches the second parenthesized group (.*) and goes into \2.

So, the entire regex match is ,_Days_Bay Bay some of the time and all of that will be removed from the result and replaced with whatever you told it to use in parameter #2 to re.sub.

The part that did not match at all was The sun shines in so it goes into your result string without modification.

Once again, re.sub only modifies the part of the string that matches your regex.

Thanks. I accept that `re.sub` not the best choice for this problem. I have used `re.findall` which simplified things. — Dave, Dec 14 '22 at 23:18

Numerical reference for backreference not working out in Python

1 Answers1