I'm working on a project where I have to read data from an Excel spreadsheet. I'm using Python.
I noticed when I use "re.sub()" the characters in the original string are not replaced. When I use "string.replace()" the characters from the original string get replaced, but not when I'm using "re.sub()."
I'm wondering if I'm doing something wrong. Could anyone please check this on your end?
Technical Details:
- Python version: 3.6.
- Operating System: Windows 10
- Libraries to install: openpyxl
- UTF-8 codes
- Unicode for emojis
This is what I originally had:
string = re.sub(u'([\u2000-\u206f])', " ", string)
string = re.sub(u'(\u00a0)', " ", string)
string = string.replace("‰", " ") #\u0089
string = string.replace("¤", " ") #\u00a4
Following "chepner"'s advice, I changed the logic to the following:
replacementDict = {}
replacementDict.update(dict.fromkeys(map(chr, range(0x2000, 0x206f)), " "))
replacementDict['\u00a0'] = " "
replacementDict['\u0089'] = " "
replacementDict['\u00a4'] = " "
string = string.translate(replacementDict)
But I'm still not able to remove the illegal characters from the string.
You can download the script and a sample test here:
Steps to reproduce the issue:
- Run the script as-is (removed the need to send parameters to the script), you will notice the lines that did not match are the ones with illegal characters.