Decoding Python Unicode strings that contain double blackslashes

Question

My strings look like this \\xec\\x88\\x98, but if I print them they look like this \xec\x88\x98, and when I decode them they look like this \xec\x88\x98

If I type the string in manually as \xec\x88\x98 and then decode it, I get the value I want 수.

If I x.decode('unicode-escape') it removes the double slashes, but when decoding the value returned by x.decode('unicode-escape'), the value I get is ì.

How would I go about decoding the original \\xec\\x88\\x98, so that I get the value correct output?

[This](http://stackoverflow.com/questions/29805425/python-2-7-how-to-convert-unicode-escapes-in-a-string-into-actual-utf-8-charact) seems like it might be useful. — TigerhawkT3, Dec 29 '16 at 06:30
You should _always_ tag Unicode questions with the Python version you're using because Unicode handling in Python 2 is quite different to how it works in Python 3. — PM 2Ring, Dec 29 '16 at 06:37
Is this python 2 or 3? Showing escaped strings can be confusing... can you show us the `repr` of the string (what you'd type into python to get the string)? A good way to do that is `print(repr(x))` and then post the quotes and everything. — tdelaney, Dec 29 '16 at 06:38
@tdelaney also, I wrote a web spider and the text is pulled from Korean news sites — jwnz, Dec 29 '16 at 06:45
It sounds like you did something wrong to get these strings in the first place. Maybe an extra `str` call somewhere or something like that. — user2357112, Dec 29 '16 at 07:48

PM 2Ring · Accepted Answer · 2016-12-29T07:59:18.037

In Python 2 you can use the 'string-escape' codec to convert '\\xec\\x88\\x98' to '\xec\x88\x98', which is the UTF-8 encoding of u'\uc218'.

Here's a short demo. Unfortunately, my terminal's font doesn't have that character so I can't print it. So instead I'll print its name and it's representation, and I'll also convert it to a Unicode-escape sequence.

import unicodedata as ud

src = '\\xec\\x88\\x98'
print repr(src)

s = src.decode('string-escape')
print repr(s)

u = s.decode('utf8')
print ud.name(u)
print repr(u), u.encode('unicode-escape')

output

'\\xec\\x88\\x98'
'\xec\x88\x98'
HANGUL SYLLABLE SU
u'\uc218' \uc218

However, this is a "band-aid" solution. You should try to fix this problem upstream (in your Web spider) so that you receive the data as plain UTF-8 instead of that string-escaped UTF-8 that you're currently getting.

`'string-escape'` seems to have solved my problems. Also, thanks for the tip! — jwnz, Dec 29 '16 at 08:46

Decoding Python Unicode strings that contain double blackslashes

1 Answers1