Why Does re.sub() Not Work in Python 3.6?

Question

I'm working on a project where I have to read data from an Excel spreadsheet. I'm using Python.

I noticed when I use "re.sub()" the characters in the original string are not replaced. When I use "string.replace()" the characters from the original string get replaced, but not when I'm using "re.sub()."

I'm wondering if I'm doing something wrong. Could anyone please check this on your end?

Technical Details:

Python version: 3.6.
Operating System: Windows 10
Libraries to install: openpyxl
UTF-8 codes
Unicode for emojis

This is what I originally had:

string = re.sub(u'([\u2000-\u206f])', " ", string)
string = re.sub(u'(\u00a0)', " ", string)

string = string.replace("‰", " ") #\u0089
string = string.replace("¤", " ") #\u00a4

Following "chepner"'s advice, I changed the logic to the following:

replacementDict = {}
replacementDict.update(dict.fromkeys(map(chr, range(0x2000, 0x206f)), " "))
replacementDict['\u00a0'] = " "
replacementDict['\u0089'] = " "
replacementDict['\u00a4'] = " "

string = string.translate(replacementDict)

But I'm still not able to remove the illegal characters from the string.

You can download the script and a sample test here:

Steps to reproduce the issue:

Run the script as-is (removed the need to send parameters to the script), you will notice the lines that did not match are the ones with illegal characters.

The code as-is doesn't do anything, and frankly it is quite massive. Please take a look at the [mre] help page and [edit] the question accordingly. — MisterMiyagi, Oct 14 '21 at 15:47
Issues aside, you are making far more calls to `re.sub` than you need. `string = re.sub('[\u2000-\u206f\u00a0\u1680\u180e\ufeff\u00ad]', ' ', string)` would be sufficient, for example. — chepner, Oct 14 '21 at 15:48
The capture groups are also unnecessary, since you aren't making references to them in the substitution text. — chepner, Oct 14 '21 at 15:49
You don't even need regular expressions for this. You can use `str.translate` to map single characters to their replacements. — chepner, Oct 14 '21 at 15:50
To reiterate: The question is closed because it has no [mre] to reproduce your problem. That includes *sample* input and expected/actual output. The most recent edits only serve to invalidate the existing answer, which is contrary to the point of [so]. If you cannot get an answer to work, *comment* on it instead. — MisterMiyagi, Oct 15 '21 at 06:12

chepner · Accepted Answer · 2021-10-15T11:34:07.160

3

I would replace all this with a single call to str.translate, since you are only making single-character-to-single-character replacements.

You'll just need to define a single dict (that you can reused for every call to str.translate) that maps each character to its replacement. Characters that stay the same do not need to be added to the mapping.

replacements = {}
replacements.update(dict.fromkeys(range(0x2000, 0x2070), " "))
replacements[0x1680] = ' '
# etc

string = string.translate(replacements)

You can also use str.maketrans to construct an appropriate translation table from a char-to-char mapping.

edited Oct 15 '21 at 11:34

answered Oct 14 '21 at 15:54

chepner

497,756
71
530
681

Hello chepner. Thank you for taking the time to reply to my question. I tried your solution, but it still doesn't work for me. – Luis Oct 14 '21 at 20:46
The replacement map must contain the ordinals, not the characters. (I always mess that up too ) E.g. ``replacements[0x1680] = ' '``. One could also use ``str.maketrans`` to on the current ``replacements`` to do that automatically. – MisterMiyagi Oct 15 '21 at 06:12
Thanks. Apparently, I've never constructed the mapping for `str.translate` by hand. – chepner Oct 15 '21 at 11:33

Why Does re.sub() Not Work in Python 3.6?

1 Answers1