Unicode Cyrillic strings in Python 2.7

Question

I've got a piece of Python 2.7 code that returns a webpage encoded in UTF-8. It essentially does this:

  arequest=urllib2.urlopen(request.httprequest.host_url[:-1]+record.path)
  response=arequest.read()
  parser = etree.HTMLParser()
  tree   = etree.fromstring(response, parser)

I am then pulling out tag information from the tree:

imgtags=map(lambda x: {'template_tag':False,'tag_type':'img','page_id':record.id,'src_value':x.attrib.get("src",""),'seo_a_title_text': x.attrib.get("title",""),'seo_text': x.attrib.get("alt","")}, tree.findall(".//img"))

The problem is that the resulting code returns this where items such as seo_a_title_text are encoded with \xd0 and not the \u0428 that I need:

[{'seo_a_title_text': u'\xd0\xa8\xd1\x82\xd0\xb0\xd1\x82\xd1\x8b ', 'src_value': '/logo.png', 'seo_text': u'Logo of \xd0\xa8\xd1\x82\xd0\xb0\xd1\x82\xd1\x8b ', 'template_tag': False, 'page_id': 150, 'tag_type': 'img'}]

The Cyrillic string is "Штаты" and I need to convert that \xd0 etc. into \u0428\u0442\u0430\u0442\u044b for a successful database save, otherwise it comes out looking like "Ð¨ÑÐ°ÑÑ" when I read it back again.

How do I get the strings looking like the \u etc. rather than the \x etc.? I must be missing something but I've been thrashing around for hours now on the web and in a console trying to get it to work.

Side note, the top of the file has this comment:

# -*- coding: utf-8 -*-

Not sure if this will affect the answers?

Just as a hint towards the answer, `'\xd0\xa8\xd1\x82\xd0\xb0\xd1\x82\xd1\x8b'.decode("utf8") == u'\u0428\u0442\u0430\u0442\u044b'`. — Phillip, Jan 31 '17 at 15:24
Side-note: Just about everything involving non-ASCII text is better in Python 3. If nothing else, the fact that only `str` has `encode` and only `bytes` has `decode` methods, and neither silently converts to the other makes it much harder to make mistakes. I'd strongly consider switching. — ShadowRanger, Jan 31 '17 at 15:29
"\xd0\xa8\xd1\x82\xd0\xb0\xd1\x82\xd1\x8b".decode("utf8") turns into an error UnicodeEncodeError: 'charmap' codec can't encode characters in position 0-4: character maps to — Mike, Jan 31 '17 at 15:30
I have no choice but to use python 2.x the system i'm using is built in 2.x — Mike, Jan 31 '17 at 15:31
@Mike: And you can't upgrade to/install Python 3? 2.x is not terrible, but one of its biggest weak points is Unicode handling (even some built-in modules, e.g. `csv`, can't handle Unicode data correctly without a ton of manual fixes). — ShadowRanger, Jan 31 '17 at 15:34
Hi, no the system is a full application with thousands of py files. Stuck in this python. — Mike, Jan 31 '17 at 15:35
If you _really_ get "u'\xd0\xa8\xd1\x82\xd0\xb0\xd1\x82\xd1\x8b" then something is wrong - this is utf-8 encoded content but as a unicode object, so you cannot properly decode / reencode it. Have you tried specifying the encoding to `etree.HTMLParser()` ? — bruno desthuilliers, Jan 31 '17 at 15:39
The top of the file is this:-- coding: utf-8 -- does that affect things? — Mike, Jan 31 '17 at 15:40
The output of the lambda from the tree is that dictionary. Which is encoded with the \x format. — Mike, Jan 31 '17 at 15:41
nb : the "-- coding: utf-8 --" (shouldn't that be "# -*- coding: utf-8 -*-" BTW ?) at the top of your python module only specifies the encoding for the module itself and only affects litteral strings. — bruno desthuilliers, Jan 31 '17 at 15:43
**Have you tried specifying the correct encoding to etree.HTMLParser() ?** — bruno desthuilliers, Jan 31 '17 at 15:45
@brunodesthuilliers ->Yes this has worked, thanks very much :) - parser = etree.HTMLParser(encoding='UTF-8') = Solution. Then i can run imgtags=map(lambda x: {'template_tag':False,'tag_type':'img','page_id':record.id,'src_value':x.attrib.get("src",""),'seo_a_title_text': x.attrib.get("title","").encode('utf-8'),'seo_text': x.attrib.get("alt","").encode('utf-8')}, tree.findall(".//img")) and the strings come out correctly. Thanks again. — Mike, Jan 31 '17 at 15:53
`arequest.headers.getheader('content-type')` should indicate the encoding instead of hard-coding it. It may be incorrectly reported by the page, however. — Mark Tolonen, Jan 31 '17 at 18:00

Cyrbil · Answer 1 · 2017-01-31T16:28:39.033

This \xd0\xa8\xd1\x82\xd0\xb0\xd1\x82\xd1\x8b string is the utf8 representation of Штаты.

Utf8 encodes characters using one or many bytes, so for instance: Ш (which has the position 0x0428 in the unicode table, will be encoded in utf8 as \xd0\xa8).

Now the tricky part, you are getting a utf8 string as a unicode string. You need to convert it to bytes before applying the utf8 correctly. One trick is to uses ISO 8859-1 (aka Latin-1) because it maps the first 256 Unicode codepoints to their byte values.

>>> u'\xd0\xa8\xd1\x82\xd0\xb0\xd1\x82\xd1\x8b'
'Ð¨Ñ\x82Ð°Ñ\x82Ñ\x8b'
>>> u'\xd0\xa8\xd1\x82\xd0\xb0\xd1\x82\xd1\x8b'.encode('latin1')
b'\xd0\xa8\xd1\x82\xd0\xb0\xd1\x82\xd1\x8b'
>>> u'\xd0\xa8\xd1\x82\xd0\xb0\xd1\x82\xd1\x8b'.encode('latin1').decode('utf8')
'Штаты'

Note: As stated by bruno, the parser can be configured with the correct encoding directly. Which would avoid this kind of dirty encoding jungling...

parser = etree.HTMLParser(encoding='utf8')

A simpler solution is to tell `lxml.HTMLParser` which encoding the html content is supposed to be encoded in — bruno desthuilliers, Jan 31 '17 at 16:09

Grigor Kolev · Answer 2 · 2017-01-31T14:52:20.010

-1

var = [{'seo_a_title_text': u'\xd0\xa8\xd1\x82\xd0\xb0\xd1\x82\xd1\x8b ', 'src_value': '/logo.png', 'seo_text': u'Logo of \xd0\xa8\xd1\x82\xd0\xb0\xd1\x82\xd1\x8b ', 'template_tag': False, 'page_id': 150, 'tag_type': 'img'}]
print var[0]['seo_a_title_text']

edited Jan 31 '17 at 14:52

answered Jan 31 '17 at 14:50

Grigor Kolev

27
8

1

This doesnt answer the question? I can get the string out easily but i want to get the string out in a different encoding. it needs to come out as \u0428\u0442\u0430\u0442\u044b. – Mike Jan 31 '17 at 15:17
Must make class with __str__() – Grigor Kolev Jan 31 '17 at 15:19
so in python how do i get var[0]['seo_a_title_text'] out as '\u0428\u0442\u0430\u0442\u044b' NOT '\xd0\xa8\xd1\x82\xd0\xb0\xd1\x82\xd1\x8b ' - Or better yet, do that conversion in the lambda code itself – Mike Jan 31 '17 at 15:21
The second part results in this:UnicodeEncodeError: 'charmap' codec can't encode character u'\u0427' in position 0: character maps to – Mike Jan 31 '17 at 15:38
Show all script. I have no error in ipython with python 2.7.3 – Grigor Kolev Jan 31 '17 at 15:52
Solution was to make sure i've specified to etree the encoding. Thanks :) – Mike Jan 31 '17 at 15:54
thank you for your vote. I have not problem with python ( I have 10 years experience), but have big problem with English. I am not native speaker – Grigor Kolev Jan 31 '17 at 16:37

Unicode Cyrillic strings in Python 2.7

2 Answers2