4

I use python 2.7 and I'm receiving a string from a server (not in unicode!). Inside that string I find text with unicode escape sequences. For example like this:

<a href = "http://www.mypage.com/\u0441andmoretext">\u00b2<\a>

How do I convert those \uxxxx - back to utf-8? The answers I found were either dealing with &# or required eval() which is too slow for my purposes. I need a universal solution for any text containing such sequenes.

Edit: <\a> is a typo but I want a tolerance against such typos as well. There should only be reaction to \u

The example text is meant in proper python syntax like this:

"<a href = \"http://www.mypage.com/\\u0441andmoretext\">\\u00b2<\\a>"

The desired output is in proper python syntax

"<a href = \"http://www.mypage.com/\xd1\x81andmoretext\">\xc2\xb2<\\a>"
evolution
  • 593
  • 6
  • 20
  • have you tried `str.encode('utf-8')`? That converts a string into unicode. – Matthew Apr 22 '15 at 17:59
  • `<\a>` isn't valid HTML either... – Eric Apr 22 '15 at 18:02
  • what are you trying to do? – Padraic Cunningham Apr 22 '15 at 18:04
  • In what encoding are you receiving the string from the external source? – Paulo Bu Apr 22 '15 at 18:04
  • 1
    The fact that your string contains `\a` and not `\\a` strongly suggests this is not possible - how can you distinguish _"I want the character entity described by `\u0441`"_ from _"I want the sequence of 6 characters `\u0441`"_ – Eric Apr 22 '15 at 18:06
  • 1
    I think `<\a>` is a typo – Paulo Bu Apr 22 '15 at 18:07
  • 1
    Is this the string you want... `'\xc2\xb2'` – Shashank Apr 22 '15 at 18:12
  • yes `<\a>` is a typo, but I want to be tolerant to such typos. And yes I want the string Shashank mentions. @Eric: I can't distinguish those cases but I want it always converted by default whenever there is a substring like that. `\\u0441` (I mean `\\\u0441` in proper python syntax) should be converted to `\\xd1` (by which I mean `\\\xd1` in proper python syntax) – evolution Apr 22 '15 at 19:28
  • Now I'm getting confused with those slashes. and I meant \xd1\x81 of course... I updated the question – evolution Apr 22 '15 at 19:46

2 Answers2

6

Try

>>> s = "<a href = \"http://www.mypage.com/\\u0441andmoretext\">\\u00b2<\\a>"
>>> s.decode("raw_unicode_escape")
u'<a href = "http://www.mypage.com/\u0441andmoretext">\xb2<\\a>'

And then you can encode to utf8 as usual.

Ella Sharakanski
  • 2,683
  • 3
  • 27
  • 47
  • 1
    Seems more like what I was looking for. For some reason it still doesn't convert the \\u0441 (as it didn't for you) – evolution May 02 '15 at 19:52
2

Python does contain some special string codecs for cases like this.

In this case, if there are no other characters outside the 32-127 range, you can safely decode your byte-string using the "unicode_escape" codec to have a proper Unicode text object in Python. (On which your program should be performing all textual operations) - Whenever you are outputting that text again, you convert it to utf-8 as usual:

rawtext = r"""<a href="http://www.mypage.com/\u0441andmoretext">\u00b2<\a>"""
text = rawtext.decode("unicode_escape")
# Text operations go here
...
output_text = text.encode("utf-8")

If there are othe bytes outside the 32-127 range, the unicode_escape codec assumes them to be in the latin1 encoding. So if your response mixes utf-8 and these \uXXXX sequences you have to:

  1. decode the original string using utf-8
  2. encode back to latin1
  3. decode using "unicode_escape"
  4. work on the text
  5. encode back to utf-8
jsbueno
  • 99,910
  • 10
  • 151
  • 209