I was trying to deal with difflib matches that return double word place names when only one of the words has been used to make the match. That is: when I do the difflib regex substitution I get a double up of the second word.
Approach:
- capture substrings so repeated word is last/first of substrings
- if words the same then remove first word & substitute for this 'everything after' substring
I don't understand the output I am getting using Python backreferences.
# removeDupeWords.py --- test to remove double words eg "The sun shines in,_Days_Bay Bay some of the time"
import re
testString = "The sun shines in,_Days_Bay Bay some of the time"
# regex to capture comma to space of testString e.g ',_Days_Bay'
refRegex = '(,\S+)'
# regex to capture everything after e.g 'Bay some of the time'
afterRegex = '(,\S+)(.*)'
refString = re.search(refRegex, testString).group(0)
# print(refString)
afterString = re.sub(afterRegex, r'\2', testString)
print(afterString)
The output for r'\0'
, r'\1'
& r'\2'
is as follows:
The sun shines in
The sun shines in,_Days_Bay
The sun shines in Bay some of the time
I just want ' Bay some of the time'
The docs Regular Expression HOWTO don't go into backreferences in much detail. I couldn't get enough info to offer any explanation why I would even get any output for r'\0'