python ascii codes to utf

Question

So when i post a name or text in mod_python in my native language i get:

&#1084;&#1072;&#1082;&#1077;&#1076;&#1086;&#1085;&#1080;&#1112;&#1072;

And i also get:

UnicodeEncodeError: 'ascii' codec can't encode characters in position 0-9: ordinal not in range(128)

When i use:

hparser = HTMLParser.HTMLParser() 
    req.write(hparser.unescape(text))

How can i decode it?

Katriel · Accepted Answer · 2012-04-16T11:16:04.307

It's hard to explain UnicodeErrors if you don't understand the underlying mechanism. You should really read either or both of

Pragmatic Unicode (Ned Batchelder)
The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!) (Joel Spolsky)

In a (very small) nutshell, a Unicode code point is an abstract "thingy" representing one character¹. Programmers like to work with these, because we like to think of strings as coming one character at a time. Unfortunately, it was decreed a long time ago that a character must fit in one byte of memory, so there can be at most 256 different characters. Which is fine for plain English, but doesn't work for anything else. There's a global list of code points -- thousands of them -- which are meant to hold every possible character, but clearly they don't fit in a byte.

The solution: there is a difference between the ordered list of code points that make a string, and its encoding as a sequence of bytes. You have to be clear whenever you work with a string which of these forms it should be in.

To convert between the forms you can .encode() a list of code points (a Unicode string) as a list of bytes, and .decode() bytes into a list of code points. To do so, you need to know how to map code points into bytes and vice versa, which is the encoding. If you don't specify one, Python 2.x will guess that you meant ASCII. If that guess is wrong, you will get a UnicodeError.

Note that Python 3.x is much better at handling Unicode strings, because the distinction between bytes and code points is much more clear cut.

¹Sort of.

EDIT: I guess I should point out how this helps. But you really should read the above links! Just throwing in .encode()s and .decode()s everywhere is a terrible way to code, and one day you'll get bitten by a worse problem.

Anyway, if you step through what you're doing in the shell you'll see

>>> from HTMLParser import HTMLParser
>>> text = "&#1084;&#1072;&#1082;&#1077;&#1076;&#1086;&#1085;&#1080;&#1112;&#1072;"
>>> hparser = HTMLParser()
>>> text = hparser.unescape(text)
>>> text
u'\u043c\u0430\u043a\u0435\u0434\u043e\u043d\u0438\u0458\u0430'

I'm using Python 2.7 here, so that's a Unicode string i.e. a sequence of Unicode code points. We can encode them into a regular string (i.e. a list of bytes) like

>>> text.encode("utf-8")
'\xd0\xbc\xd0\xb0\xd0\xba\xd0\xb5\xd0\xb4\xd0\xbe\xd0\xbd\xd0\xb8\xd1\x98\xd0\xb0'

But we could also pick a different encoding!

>>> text.encode("utf-16")
'\xff\xfe<\x040\x04:\x045\x044\x04>\x04=\x048\x04X\x040\x04'

You'll need to decide what encoding you want to use.

What went wrong when you did it? Well, not every encoding understands every code point. In particular, the "ascii" encoding only understands the first 256! So if you try

>>> text.encode("ascii")
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'ascii' codec can't encode characters in position 0-9: ordinal not in range(128)

you just get an error, because you can't encode those code points in ASCII.

When you do req.write, you are trying to write a list of code points down the request. But HTML requests don't understand code points: they just use ASCII. Python 2 will try to be helpful by automatically ASCII-encoding your Unicode strings, which is fine if they really are ASCII but not if they aren't.

So you need to do req.write(hparser.unescape(text).encode("some-encoding")).

This is a good explanation of what unicode is, although I'm not sure it really helps the OP go from HTML entities to utf-8 output. — Wooble, Apr 16 '12 at 11:05
@Wooble but the OP's problem isn't the HTML entities! It's the Unicode encoding (as evidenced by the `UnicodeEncodeError`). — Katriel, Apr 16 '12 at 11:09
I need html entity to string, tnx for the explanation but i have read a lot of characters expression. — badc0re, Apr 16 '12 at 11:15
@DameJovanoski No, you haven't! If you understood Unicode support in Python, you would see why you got a Unicode error. The problem isn't in the HTML entity part. — Katriel, Apr 16 '12 at 11:20
my problem was that i forgot to add .encode('utf8') to text variable after the unescape, but tnx a lot — badc0re, Apr 16 '12 at 11:30
+1 for "Just throwing in .encode()s and .decode()s everywhere is a terrible way to code, and one day you'll get bitten by a worse problem." — bgporter, Apr 16 '12 at 11:34
@DameJovanoski: yes, I know! I'll try once more: you need to __understand Unicode__, not just throw `.encode()`s and `.decode()`s on all your strings until it works. — Katriel, Apr 16 '12 at 13:01

python ascii codes to utf

1 Answers1