2

My strings look like this \\xec\\x88\\x98, but if I print them they look like this \xec\x88\x98, and when I decode them they look like this \xec\x88\x98

If I type the string in manually as \xec\x88\x98 and then decode it, I get the value I want .

If I x.decode('unicode-escape') it removes the double slashes, but when decoding the value returned by x.decode('unicode-escape'), the value I get is ì.

How would I go about decoding the original \\xec\\x88\\x98, so that I get the value correct output?

TigerhawkT3
  • 48,464
  • 6
  • 60
  • 97
jwnz
  • 75
  • 5
  • [This](http://stackoverflow.com/questions/29805425/python-2-7-how-to-convert-unicode-escapes-in-a-string-into-actual-utf-8-charact) seems like it might be useful. – TigerhawkT3 Dec 29 '16 at 06:30
  • You should _always_ tag Unicode questions with the Python version you're using because Unicode handling in Python 2 is quite different to how it works in Python 3. – PM 2Ring Dec 29 '16 at 06:37
  • Is this python 2 or 3? Showing escaped strings can be confusing... can you show us the `repr` of the string (what you'd type into python to get the string)? A good way to do that is `print(repr(x))` and then post the quotes and everything. – tdelaney Dec 29 '16 at 06:38
  • @PM2Ring I've updated the tags, thx – jwnz Dec 29 '16 at 06:42
  • @tdelaney `'\\xec\\x88\\x98'` is what I get – jwnz Dec 29 '16 at 06:44
  • @tdelaney also, I wrote a web spider and the text is pulled from Korean news sites – jwnz Dec 29 '16 at 06:45
  • It sounds like you did something wrong to get these strings in the first place. Maybe an extra `str` call somewhere or something like that. – user2357112 Dec 29 '16 at 07:48

1 Answers1

2

In Python 2 you can use the 'string-escape' codec to convert '\\xec\\x88\\x98' to '\xec\x88\x98', which is the UTF-8 encoding of u'\uc218'.

Here's a short demo. Unfortunately, my terminal's font doesn't have that character so I can't print it. So instead I'll print its name and it's representation, and I'll also convert it to a Unicode-escape sequence.

import unicodedata as ud

src = '\\xec\\x88\\x98'
print repr(src)

s = src.decode('string-escape')
print repr(s)

u = s.decode('utf8')
print ud.name(u)
print repr(u), u.encode('unicode-escape')

output

'\\xec\\x88\\x98'
'\xec\x88\x98'
HANGUL SYLLABLE SU
u'\uc218' \uc218

However, this is a "band-aid" solution. You should try to fix this problem upstream (in your Web spider) so that you receive the data as plain UTF-8 instead of that string-escaped UTF-8 that you're currently getting.

PM 2Ring
  • 54,345
  • 6
  • 82
  • 182