Removing everything except letters and spaces from string in Python3.3

Question

I have this example string: happy t00 go 129.129 and I want to keep only the spaces and letters. All I have been able to come up with so far that is pretty efficient is:

print(re.sub("\d", "", 'happy t00 go 129.129'.replace('.', '')))

but it is only specific to my example string. How can remove all characters other than letters and spaces?

None of answers contains other than 24 letters, e.g. ß, Ä, Ö, Ü, Ą, Ż, etc. Perhaps question should mention only ASCII letters? — Katarzyna, Sep 21 '20 at 15:52

score 29 · Accepted Answer · edited Aug 31 '18 at 17:19

29

whitelist = set('abcdefghijklmnopqrstuvwxyz ABCDEFGHIJKLMNOPQRSTUVWXYZ')
myStr = "happy t00 go 129.129$%^&*("
answer = ''.join(filter(whitelist.__contains__, myStr))

Output:

>>> answer
'happy t go '

edited Aug 31 '18 at 17:19

Kevin Rickard

25
6

answered Feb 04 '14 at 22:15

inspectorG4dget

110,290
27
149
241

After testing I found this to be 0.0029 usec faster than Joel's answer when run as a `python -m timeit -n 100 -s` loop for each code in Command Prompt. – Gronk Feb 04 '14 at 23:39
1

@Gronk: `>>> Timer('"".join(filter(whitelist.__contains__, myStr))', ''' ... whitelist = set('abcdefghijklmnopqrstuvwxy ABCDEFGHIJKLMNOPQRSTUVWXYZ') ... myStr = 'happy t00 go 129.129' * 10''').timeit(number=1000) 0.02490997314453125 >>> Timer('re.sub(r"[^a-zA-Z ]+", "", myStr)', '''import re ... myStr = 'happy t00 go 129.129' * 10''').timeit(number=1000) 0.011039972305297852 >>> `. My point is that 0.0029 usec is definitely within the normal variation for a sample size of 100. – Joel Cornett Feb 05 '14 at 01:09
1

This also filters accented alphabet characters which might be a problem. – bp. Sep 29 '17 at 04:10
The lowercase character "z" is missing – Daniel Marschall May 19 '18 at 16:32

score 18 · Answer 2 · answered Feb 04 '14 at 22:15

18

Use a set complement:

re.sub(r'[^a-zA-Z ]+', '', 'happy t00 go 129.129')

answered Feb 04 '14 at 22:15

Joel Cornett

24,192
9
66
88

Alex L · Answer 3 · 2018-03-26T07:52:38.603

Slight variation on inspectorG4dget's method - import from string & generator comprehension:

from string import ascii_letters

allowed = set(ascii_letters + ' ')
myStr = 'happy t00 go 129.129'
answer = ''.join(l for l in myStr if l in allowed)
answer
# >>> 'happy t go '

Performance comparison:

(I made myStr a bit longer and pre-compiled the regex to make things a bit more interesting)

import re
from string import ascii_letters, digits
myStr = 'happy t00 go 129.129'*20
allowed = set(ascii_letters + ' ')

# Generator
%timeit answer = ''.join(l for l in myStr if l in allowed)

# filter/__contains__
%timeit answer = ''.join(filter(allowed.__contains__, myStr))

# Regex
pat = re.compile(r'[^a-zA-Z ]+')
%timeit answer = re.sub(pat, '', myStr)

53 µs ± 6.43 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
43.3 µs ± 7.48 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
26 µs ± 509 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)

I found this to be the best answer. It is more readable and it shows how we can use the [string constants](https://docs.python.org/3/library/string.html#module-string) instead of typing them manually which could easily introduce an error. — Bernard, Feb 23 '19 at 03:36

Removing everything except letters and spaces from string in Python3.3

3 Answers3

Performance comparison: