How can I replace any character outside of the English alphabet?
For example, 'abcdükl*m' replaced with a ' ' would be 'abcd kl m'
How can I replace any character outside of the English alphabet?
For example, 'abcdükl*m' replaced with a ' ' would be 'abcd kl m'
Use the regex [^a-zA-Z]
:
re.sub(r'[^a-zA-Z]', '', mystring)
Some info: the a-zA-Z
are character ranges that indicate all the lowercase and uppercase letter, respectively, and the caret ^
at the beginning of the character class indicates negation, e.g. "anything except these".
Assuming you're trying to normalize text, see my link under "Comprehensive character replacement module in python for non-unicode and non-ascii for HTML".
unicodedata
has a normalize
method that can gracefully degrade text for you:
import unicodedata
def gracefully_degrade_to_ascii( text ):
return unicodedata.normalize('NFKD',text).encode('ascii','ignore')
Full Docs - http://docs.python.org/library/unicodedata.html
If you're trying to just strip out non-ASCII chars, the negated character set regex that others mentioned is the way to do it.
>>> import string
>>> print ''.join(x if x in string.ascii_letters else ' ' for x in u'abcdükl*m')
abcd kl m