Python 2.7: How to convert unicode escapes in a string into actual utf-8 characters

Question

I use python 2.7 and I'm receiving a string from a server (not in unicode!). Inside that string I find text with unicode escape sequences. For example like this:

<a href = "http://www.mypage.com/\u0441andmoretext">\u00b2<\a>

How do I convert those \uxxxx - back to utf-8? The answers I found were either dealing with &# or required eval() which is too slow for my purposes. I need a universal solution for any text containing such sequenes.

Edit: <\a> is a typo but I want a tolerance against such typos as well. There should only be reaction to \u

The example text is meant in proper python syntax like this:

"<a href = \"http://www.mypage.com/\\u0441andmoretext\">\\u00b2<\\a>"

The desired output is in proper python syntax

"<a href = \"http://www.mypage.com/\xd1\x81andmoretext\">\xc2\xb2<\\a>"

have you tried `str.encode('utf-8')`? That converts a string into unicode. — Matthew, Apr 22 '15 at 17:59
In what encoding are you receiving the string from the external source? — Paulo Bu, Apr 22 '15 at 18:04
The fact that your string contains `\a` and not `\\a` strongly suggests this is not possible - how can you distinguish _"I want the character entity described by `\u0441`"_ from _"I want the sequence of 6 characters `\u0441`"_ — Eric, Apr 22 '15 at 18:06
yes `<\a>` is a typo, but I want to be tolerant to such typos. And yes I want the string Shashank mentions. @Eric: I can't distinguish those cases but I want it always converted by default whenever there is a substring like that. `\\u0441` (I mean `\\\u0441` in proper python syntax) should be converted to `\\xd1` (by which I mean `\\\xd1` in proper python syntax) — evolution, Apr 22 '15 at 19:28
Now I'm getting confused with those slashes. and I meant \xd1\x81 of course... I updated the question — evolution, Apr 22 '15 at 19:46

score 6 · Answer 1 · answered Apr 23 '15 at 20:20

6

Try

>>> s = "<a href = \"http://www.mypage.com/\\u0441andmoretext\">\\u00b2<\\a>"
>>> s.decode("raw_unicode_escape")
u'<a href = "http://www.mypage.com/\u0441andmoretext">\xb2<\\a>'

And then you can encode to utf8 as usual.

answered Apr 23 '15 at 20:20

Ella Sharakanski

2,683
3
27
47

1

Seems more like what I was looking for. For some reason it still doesn't convert the \\u0441 (as it didn't for you) – evolution May 02 '15 at 19:52

score 2 · Answer 2 · answered Apr 22 '15 at 18:14

Python does contain some special string codecs for cases like this.

In this case, if there are no other characters outside the 32-127 range, you can safely decode your byte-string using the "unicode_escape" codec to have a proper Unicode text object in Python. (On which your program should be performing all textual operations) - Whenever you are outputting that text again, you convert it to utf-8 as usual:

rawtext = r"""<a href="http://www.mypage.com/\u0441andmoretext">\u00b2<\a>"""
text = rawtext.decode("unicode_escape")
# Text operations go here
...
output_text = text.encode("utf-8")

If there are othe bytes outside the 32-127 range, the unicode_escape codec assumes them to be in the latin1 encoding. So if your response mixes utf-8 and these \uXXXX sequences you have to:

decode the original string using utf-8
encode back to latin1
decode using "unicode_escape"
work on the text
encode back to utf-8

This converts the `"\\a"` too, and I think the OP wanted it to remain as it is. I get `text = u'\xb2<\x07>'` — Ella Sharakanski, Apr 23 '15 at 17:41
Which is so bad news to the OP - that means the only workable solution will be a regexp -substitution parsing. — jsbueno, Apr 23 '15 at 19:49

Python 2.7: How to convert unicode escapes in a string into actual utf-8 characters

2 Answers2

Linked

Related