1

I've got a piece of Python 2.7 code that returns a webpage encoded in UTF-8. It essentially does this:

  arequest=urllib2.urlopen(request.httprequest.host_url[:-1]+record.path)
  response=arequest.read()
  parser = etree.HTMLParser()
  tree   = etree.fromstring(response, parser)

I am then pulling out tag information from the tree:

imgtags=map(lambda x: {'template_tag':False,'tag_type':'img','page_id':record.id,'src_value':x.attrib.get("src",""),'seo_a_title_text': x.attrib.get("title",""),'seo_text': x.attrib.get("alt","")}, tree.findall(".//img"))

The problem is that the resulting code returns this where items such as seo_a_title_text are encoded with \xd0 and not the \u0428 that I need:

[{'seo_a_title_text': u'\xd0\xa8\xd1\x82\xd0\xb0\xd1\x82\xd1\x8b ', 'src_value': '/logo.png', 'seo_text': u'Logo of \xd0\xa8\xd1\x82\xd0\xb0\xd1\x82\xd1\x8b ', 'template_tag': False, 'page_id': 150, 'tag_type': 'img'}]

The Cyrillic string is "Штаты" and I need to convert that \xd0 etc. into \u0428\u0442\u0430\u0442\u044b for a successful database save, otherwise it comes out looking like "ШÑаÑÑ" when I read it back again.

How do I get the strings looking like the \u etc. rather than the \x etc.? I must be missing something but I've been thrashing around for hours now on the web and in a console trying to get it to work.

Side note, the top of the file has this comment:

# -*- coding: utf-8 -*-

Not sure if this will affect the answers?

Jongware
  • 22,200
  • 8
  • 54
  • 100
Mike
  • 11
  • 4
  • 1
    Just as a hint towards the answer, `'\xd0\xa8\xd1\x82\xd0\xb0\xd1\x82\xd1\x8b'.decode("utf8") == u'\u0428\u0442\u0430\u0442\u044b'`. – Phillip Jan 31 '17 at 15:24
  • 1
    Side-note: Just about everything involving non-ASCII text is better in Python 3. If nothing else, the fact that only `str` has `encode` and only `bytes` has `decode` methods, and neither silently converts to the other makes it much harder to make mistakes. I'd strongly consider switching. – ShadowRanger Jan 31 '17 at 15:29
  • "\xd0\xa8\xd1\x82\xd0\xb0\xd1\x82\xd1\x8b".decode("utf8") turns into an error UnicodeEncodeError: 'charmap' codec can't encode characters in position 0-4: character maps to – Mike Jan 31 '17 at 15:30
  • I have no choice but to use python 2.x the system i'm using is built in 2.x – Mike Jan 31 '17 at 15:31
  • @Mike: And you can't upgrade to/install Python 3? 2.x is not terrible, but one of its biggest weak points is Unicode handling (even some built-in modules, e.g. `csv`, can't handle Unicode data correctly without a ton of manual fixes). – ShadowRanger Jan 31 '17 at 15:34
  • Hi, no the system is a full application with thousands of py files. Stuck in this python. – Mike Jan 31 '17 at 15:35
  • If you _really_ get "u'\xd0\xa8\xd1\x82\xd0\xb0\xd1\x82\xd1\x8b" then something is wrong - this is utf-8 encoded content but as a unicode object, so you cannot properly decode / reencode it. Have you tried specifying the encoding to `etree.HTMLParser()` ? – bruno desthuilliers Jan 31 '17 at 15:39
  • The top of the file is this:-- coding: utf-8 -- does that affect things? – Mike Jan 31 '17 at 15:40
  • The output of the lambda from the tree is that dictionary. Which is encoded with the \x format. – Mike Jan 31 '17 at 15:41
  • nb : the "-- coding: utf-8 --" (shouldn't that be "# -*- coding: utf-8 -*-" BTW ?) at the top of your python module only specifies the encoding for the module itself and only affects litteral strings. – bruno desthuilliers Jan 31 '17 at 15:43
  • **Have you tried specifying the correct encoding to etree.HTMLParser() ?** – bruno desthuilliers Jan 31 '17 at 15:45
  • @brunodesthuilliers ->Yes this has worked, thanks very much :) - parser = etree.HTMLParser(encoding='UTF-8') = Solution. Then i can run imgtags=map(lambda x: {'template_tag':False,'tag_type':'img','page_id':record.id,'src_value':x.attrib.get("src",""),'seo_a_title_text': x.attrib.get("title","").encode('utf-8'),'seo_text': x.attrib.get("alt","").encode('utf-8')}, tree.findall(".//img")) and the strings come out correctly. Thanks again. – Mike Jan 31 '17 at 15:53
  • `arequest.headers.getheader('content-type')` should indicate the encoding instead of hard-coding it. It may be incorrectly reported by the page, however. – Mark Tolonen Jan 31 '17 at 18:00

2 Answers2

3

This \xd0\xa8\xd1\x82\xd0\xb0\xd1\x82\xd1\x8b string is the utf8 representation of Штаты.

Utf8 encodes characters using one or many bytes, so for instance: Ш (which has the position 0x0428 in the unicode table, will be encoded in utf8 as \xd0\xa8).

Now the tricky part, you are getting a utf8 string as a unicode string. You need to convert it to bytes before applying the utf8 correctly. One trick is to uses ISO 8859-1 (aka Latin-1) because it maps the first 256 Unicode codepoints to their byte values.

>>> u'\xd0\xa8\xd1\x82\xd0\xb0\xd1\x82\xd1\x8b'
'ШÑ\x82аÑ\x82Ñ\x8b'
>>> u'\xd0\xa8\xd1\x82\xd0\xb0\xd1\x82\xd1\x8b'.encode('latin1')
b'\xd0\xa8\xd1\x82\xd0\xb0\xd1\x82\xd1\x8b'
>>> u'\xd0\xa8\xd1\x82\xd0\xb0\xd1\x82\xd1\x8b'.encode('latin1').decode('utf8')
'Штаты'

Note: As stated by bruno, the parser can be configured with the correct encoding directly. Which would avoid this kind of dirty encoding jungling...

parser = etree.HTMLParser(encoding='utf8')
Cyrbil
  • 6,341
  • 1
  • 24
  • 40
-1
var = [{'seo_a_title_text': u'\xd0\xa8\xd1\x82\xd0\xb0\xd1\x82\xd1\x8b ', 'src_value': '/logo.png', 'seo_text': u'Logo of \xd0\xa8\xd1\x82\xd0\xb0\xd1\x82\xd1\x8b ', 'template_tag': False, 'page_id': 150, 'tag_type': 'img'}]
print var[0]['seo_a_title_text']
  • 1
    This doesnt answer the question? I can get the string out easily but i want to get the string out in a different encoding. it needs to come out as \u0428\u0442\u0430\u0442\u044b. – Mike Jan 31 '17 at 15:17
  • Must make class with __str__() – Grigor Kolev Jan 31 '17 at 15:19
  • so in python how do i get var[0]['seo_a_title_text'] out as '\u0428\u0442\u0430\u0442\u044b' NOT '\xd0\xa8\xd1\x82\xd0\xb0\xd1\x82\xd1\x8b ' - Or better yet, do that conversion in the lambda code itself – Mike Jan 31 '17 at 15:21
  • The second part results in this:UnicodeEncodeError: 'charmap' codec can't encode character u'\u0427' in position 0: character maps to – Mike Jan 31 '17 at 15:38
  • Show all script. I have no error in ipython with python 2.7.3 – Grigor Kolev Jan 31 '17 at 15:52
  • Solution was to make sure i've specified to etree the encoding. Thanks :) – Mike Jan 31 '17 at 15:54
  • thank you for your vote. I have not problem with python ( I have 10 years experience), but have big problem with English. I am not native speaker – Grigor Kolev Jan 31 '17 at 16:37