I've got a piece of Python 2.7 code that returns a webpage encoded in UTF-8. It essentially does this:
arequest=urllib2.urlopen(request.httprequest.host_url[:-1]+record.path)
response=arequest.read()
parser = etree.HTMLParser()
tree = etree.fromstring(response, parser)
I am then pulling out tag information from the tree:
imgtags=map(lambda x: {'template_tag':False,'tag_type':'img','page_id':record.id,'src_value':x.attrib.get("src",""),'seo_a_title_text': x.attrib.get("title",""),'seo_text': x.attrib.get("alt","")}, tree.findall(".//img"))
The problem is that the resulting code returns this where items such as seo_a_title_text
are encoded with \xd0
and not the \u0428
that I need:
[{'seo_a_title_text': u'\xd0\xa8\xd1\x82\xd0\xb0\xd1\x82\xd1\x8b ', 'src_value': '/logo.png', 'seo_text': u'Logo of \xd0\xa8\xd1\x82\xd0\xb0\xd1\x82\xd1\x8b ', 'template_tag': False, 'page_id': 150, 'tag_type': 'img'}]
The Cyrillic string is "Штаты" and I need to convert that \xd0
etc. into
\u0428\u0442\u0430\u0442\u044b
for a successful database save, otherwise it comes out looking like "ШÑаÑÑ" when I read it back again.
How do I get the strings looking like the \u
etc. rather than the \x
etc.? I must be missing something but I've been thrashing around for hours now on the web and in a console trying to get it to work.
Side note, the top of the file has this comment:
# -*- coding: utf-8 -*-
Not sure if this will affect the answers?