19

I have this example string: happy t00 go 129.129 and I want to keep only the spaces and letters. All I have been able to come up with so far that is pretty efficient is:

print(re.sub("\d", "", 'happy t00 go 129.129'.replace('.', '')))

but it is only specific to my example string. How can remove all characters other than letters and spaces?

Gronk
  • 381
  • 1
  • 3
  • 12
  • None of answers contains other than 24 letters, e.g. ß, Ä, Ö, Ü, Ą, Ż, etc. Perhaps question should mention only ASCII letters? – Katarzyna Sep 21 '20 at 15:52

3 Answers3

29
whitelist = set('abcdefghijklmnopqrstuvwxyz ABCDEFGHIJKLMNOPQRSTUVWXYZ')
myStr = "happy t00 go 129.129$%^&*("
answer = ''.join(filter(whitelist.__contains__, myStr))

Output:

>>> answer
'happy t go '
inspectorG4dget
  • 110,290
  • 27
  • 149
  • 241
  • After testing I found this to be 0.0029 usec faster than Joel's answer when run as a `python -m timeit -n 100 -s` loop for each code in Command Prompt. – Gronk Feb 04 '14 at 23:39
  • 1
    @Gronk: `>>> Timer('"".join(filter(whitelist.__contains__, myStr))', ''' ... whitelist = set('abcdefghijklmnopqrstuvwxy ABCDEFGHIJKLMNOPQRSTUVWXYZ') ... myStr = 'happy t00 go 129.129' * 10''').timeit(number=1000) 0.02490997314453125 >>> Timer('re.sub(r"[^a-zA-Z ]+", "", myStr)', '''import re ... myStr = 'happy t00 go 129.129' * 10''').timeit(number=1000) 0.011039972305297852 >>> `. My point is that 0.0029 usec is definitely within the normal variation for a sample size of 100. – Joel Cornett Feb 05 '14 at 01:09
  • 1
    This also filters accented alphabet characters which might be a problem. – bp. Sep 29 '17 at 04:10
  • The lowercase character "z" is missing – Daniel Marschall May 19 '18 at 16:32
18

Use a set complement:

re.sub(r'[^a-zA-Z ]+', '', 'happy t00 go 129.129')
Joel Cornett
  • 24,192
  • 9
  • 66
  • 88
9

Slight variation on inspectorG4dget's method - import from string & generator comprehension:

from string import ascii_letters

allowed = set(ascii_letters + ' ')
myStr = 'happy t00 go 129.129'
answer = ''.join(l for l in myStr if l in allowed)
answer
# >>> 'happy t go '

Performance comparison:

(I made myStr a bit longer and pre-compiled the regex to make things a bit more interesting)

import re
from string import ascii_letters, digits
myStr = 'happy t00 go 129.129'*20
allowed = set(ascii_letters + ' ')

# Generator
%timeit answer = ''.join(l for l in myStr if l in allowed)

# filter/__contains__
%timeit answer = ''.join(filter(allowed.__contains__, myStr))

# Regex
pat = re.compile(r'[^a-zA-Z ]+')
%timeit answer = re.sub(pat, '', myStr)

53 µs ± 6.43 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
43.3 µs ± 7.48 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
26 µs ± 509 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)

Alex L
  • 8,748
  • 5
  • 49
  • 75
  • I found this to be the best answer. It is more readable and it shows how we can use the [string constants](https://docs.python.org/3/library/string.html#module-string) instead of typing them manually which could easily introduce an error. – Bernard Feb 23 '19 at 03:36