18

Although there are similar questions, I can't seem to find a working solution for my case:

I'm encountering some annoying hex chars in strings, e.g.

'\xe2\x80\x9chttp://www.google.com\xe2\x80\x9d blah blah#%#@$^blah'

What I need is to remove these hex \xHH characters, and them alone, in order to get the following result:

'http://www.google.com blah blah#%#@$^blah'

decoding doesn't help:

s.decode('utf8') # u'\u201chttp://www.google.com\u201d blah blah#%#@$^blah'

How can I achieve that?

Kludge
  • 2,653
  • 4
  • 20
  • 42

4 Answers4

38

Just remove all non-ASCII characters:

>>> s.decode('utf8').encode('ascii', errors='ignore')
'http://www.google.com blah blah#%#@$^blah'

Other possible solution:

>>> import string
>>> s = '\xe2\x80\x9chttp://www.google.com\xe2\x80\x9d blah blah#%#@$^blah'
>>> printable = set(string.printable)
>>> filter(lambda x: x in printable, s)
'http://www.google.com blah blah#%#@$^blah'

Or use Regular expressions:

>>> import re
>>> re.sub(r'[^\x00-\x7f]',r'', s) 
'http://www.google.com blah blah#%#@$^blah'

Pick your favorite one.

Magnun Leno
  • 2,728
  • 20
  • 29
9

These are not "hex characters" but the internal representation (utf-8 encoded in the first case, unicode code point in the second case) of the unicode characters 'LEFT DOUBLE QUOTATION MARK' ('“') and 'RIGHT DOUBLE QUOTATION MARK' ('”').

>>> s = "\xe2\x80\x9chttp://www.google.com\xe2\x80\x9d blah blah#%#@$^blah"
>>> print s
“http://www.google.com” blah blah#%#@$^blah
>>> s.decode("utf-8")
u'\u201chttp://www.google.com\u201d blah blah#%#@$^blah'
>>> print s.decode("utf-8")
“http://www.google.com” blah blah#%#@$^blah

As how to remove them, they are just ordinary characters so a simple str.replace() will do:

>>> s.replace("\xe2\x80\x9c", "").replace("\xe2\x80\x9d", "")
'http://www.google.com blah blah#%#@$^blah'

If you want to get rid of all non-ascii characters at once, you just have to decode to unicode then encode to ascii with the "ignore" parameter:

>>> s.decode("utf-8").encode("ascii", "ignore")
'http://www.google.com blah blah#%#@$^blah'
bruno desthuilliers
  • 75,974
  • 6
  • 88
  • 118
  • 1
    AttributeError: 'str' object has no attribute 'decode' – Pyd Nov 16 '18 at 17:17
  • 2
    @pyd: the question is tagged as python 2.7 and `str` DO have a `decode` method in python 2.7 - which disappeared in python 3 (obviously since py3 strings are unicode so the `decode` method would make no sense - but it still exists on py3 byte string (type `byte`). – bruno desthuilliers Nov 30 '18 at 14:33
4

You could make it check for valid letters, and instead of typing out everything, it's possible to use the string module. The ones that may be useful to you are string.ascii_letters (contains both string.ascii_lowercase and string.ascii_uppercase), string.digits, string.printable and string.punctuation.

I'd try string.printable first, but if it lets a few too many characters through, you could use a mix of the others.

Here's an example of how I'd do it:

import string
valid_characters = string.printable
start_string = '\xe2\x80\x9chttp://www.google.com\xe2\x80\x9d blah blah#%#@$^blah'
end_string = ''.join(i for i in start_string if i in valid_characters)
Peter
  • 3,186
  • 3
  • 26
  • 59
2

You can use decode after encoding just like this

s.encode('ascii', errors='ignore').decode("utf-8")
Graham
  • 7,431
  • 18
  • 59
  • 84