3

I have a string that contains several types of personally identifiable information (PII):

text = 'Hello my name is Tom and I love Tomcat. My email address is tom@foo.bar and my phone number is (201) 5550123.'

I also have a list of PII that I want to remove from the string:

values = ['Tom', 'tom@foo.bar', '(201) 5550123']

I want to combine the values into a single regular expression and substitute them all in one go, instead of looping over the values and replacing them one at a time:

escaped_values = [r'\b' + re.escape(value) + r'\b' for value in values]
combined_pattern = '|'.join(escaped_values)
combined_regex = re.compile(combined_pattern)

The word boundaries are important because I don't want to remove "Tom" from "Tomcat" - only if it appears by itself. Anyways, this almost works except for the phone number:

combined_regex.sub('', text)
# 'Hello my name is  and I love Tomcat. My email address is  and my phone number is (201) 5550123.'

I isolated the problem somewhat. It has to do with the combination of parens and word boundaries:

re.compile(r'\b\(201\)\ 5550123\b').sub('', 'before (201) 5550123 after')
# 'before (201) 5550123 after'

This isn't a Python issue, as can be seen here:

RegEx Pal showing PCRE mismatch

I know there are a lot of ways I could change my program, but I don't understand why this regex doesn't work and it's driving me nuts.

Pranav Hosangadi
  • 23,755
  • 7
  • 44
  • 70
Big McLargeHuge
  • 14,841
  • 10
  • 80
  • 108
  • Explanation and several solutions can be found [here](https://stackoverflow.com/a/45145800/3832970). – Wiktor Stribiżew Mar 30 '21 at 18:52
  • The actual problem here involves the interaction of the `\b` and the `\(`. If you remove that initial `\b`, it works. `\b` matches punctuation, so it's not seeing a "word boundary" there. Wiktor's reference is a good one. – Tim Roberts Mar 30 '21 at 18:54
  • The actual problem is misunderstanding what `\b` matches. Please also see [What is a word boundary in regex](https://stackoverflow.com/questions/1324676/what-is-a-word-boundary-in-regex-does-b-match-hyphen). – Wiktor Stribiżew Mar 30 '21 at 20:05

1 Answers1

3

You may use:

import re

text = 'Hello my name is Tom and I love Tomcat. My email address is tom@foo.bar and my phone number is (201) 5550123.'
values = ['Tom', 'tom@foo.bar', '(201) 5550123']
escaped_values = [re.escape(value) for value in values]
combined_pattern = r'(?<!\w)(?:' +'|'.join(escaped_values) + r')(?!\w)'
combined_regex = re.compile(combined_pattern)

print (combined_pattern)
print()
print (combined_regex.sub('', text))

Output:

(?<!\w)(?:Tom|tom@foo\.bar|\(201\)\ 5550123)(?!\w)

'Hello my name is  and I love Tomcat. My email address is  and my phone number is .'

Take note of the combined regex in use here:

(?<!\w)(?:Tom|tom@foo\.bar|\(201\)\ 5550123)(?!\w)

RegEx Demo

RegEx Explained:

  • (?<!\w): Negative lookbehind to assert that we don't have a word character before the current position
  • (?:: Start non-capture group
    • Tom|tom@foo\.bar|\(201\)\ 5550123: Match one of these substrings separated with | (alternation)
  • ): End non-capture group
  • (?!\w): Negative lookahead to assert that we don't have a word character after the current position
anubhava
  • 761,203
  • 64
  • 569
  • 643