I have a string that contains several types of personally identifiable information (PII):
text = 'Hello my name is Tom and I love Tomcat. My email address is tom@foo.bar and my phone number is (201) 5550123.'
I also have a list of PII that I want to remove from the string:
values = ['Tom', 'tom@foo.bar', '(201) 5550123']
I want to combine the values into a single regular expression and substitute them all in one go, instead of looping over the values and replacing them one at a time:
escaped_values = [r'\b' + re.escape(value) + r'\b' for value in values]
combined_pattern = '|'.join(escaped_values)
combined_regex = re.compile(combined_pattern)
The word boundaries are important because I don't want to remove "Tom" from "Tomcat" - only if it appears by itself. Anyways, this almost works except for the phone number:
combined_regex.sub('', text)
# 'Hello my name is and I love Tomcat. My email address is and my phone number is (201) 5550123.'
I isolated the problem somewhat. It has to do with the combination of parens and word boundaries:
re.compile(r'\b\(201\)\ 5550123\b').sub('', 'before (201) 5550123 after')
# 'before (201) 5550123 after'
This isn't a Python issue, as can be seen here:
I know there are a lot of ways I could change my program, but I don't understand why this regex doesn't work and it's driving me nuts.