Replace any character outside of the English alphabet in Python?

Question

How can I replace any character outside of the English alphabet?

For example, 'abcdükl*m' replaced with a ' ' would be 'abcd kl m'

What did you try? Which resource did you consult? Do you know about "negated character classes"? — , Oct 25 '12 at 01:16

score 7 · Accepted Answer · answered Oct 25 '12 at 01:17

Use the regex [^a-zA-Z]:

re.sub(r'[^a-zA-Z]', '', mystring)

Some info: the a-zA-Z are character ranges that indicate all the lowercase and uppercase letter, respectively, and the caret ^ at the beginning of the character class indicates negation, e.g. "anything except these".

score 3 · Answer 2 · edited May 23 '17 at 12:24

Assuming you're trying to normalize text, see my link under "Comprehensive character replacement module in python for non-unicode and non-ascii for HTML".

unicodedata has a normalize method that can gracefully degrade text for you:

import unicodedata
def gracefully_degrade_to_ascii( text ):
    return unicodedata.normalize('NFKD',text).encode('ascii','ignore')

Full Docs - http://docs.python.org/library/unicodedata.html

If you're trying to just strip out non-ASCII chars, the negated character set regex that others mentioned is the way to do it.

score 1 · Answer 3 · answered Oct 25 '12 at 01:16

1

Search for [^a-zA-Z] and replace with ' '

answered Oct 25 '12 at 01:16

pogo

1,479
3
18
23

John La Rooy · Answer 4 · 2012-10-25T01:28:37.833

1

>>> import string
>>> print ''.join(x if x in string.ascii_letters else ' ' for x in u'abcdükl*m') 
abcd kl m

edited Oct 25 '12 at 01:28

answered Oct 25 '12 at 01:23

John La Rooy

295,403
53
369
502

Replace any character outside of the English alphabet in Python?

4 Answers4