0

I got completely confused with gae. I have a script, that does a post request(using urlfetch from Google App Engine api) as a response we get a cp1251 encoded html page.

Then I decode it, using .decode('cp1251') and parse with lxml.

My code works totally fine on my local machine:

import re
import leaf #simple wrapper for lxml
weekdaysD={u'понедельник':1, u'вторник':2, u'среда':3, u'четверг':4, u'пятница':5, u'суббота':6}
document = leaf.parse(leaf.strip_symbols(leaf.strip_accents(html_in_cp1251.decode('cp1251'))))
table=document.get('table')
trs=table('tr') #leaf syntax
for tr in trs:
    tds=tr.xpath('td')
    for td in tds:
        if td.colspan=='3':
            curweek=re.findall('\w+(?=\-)', td.text)[0]               
            curday=weekdaysD[td.text.split(u',')[0]]

but when I deploy it to gae, I get:

curday=weekdaysD[td.text.split(u',')[0]]
KeyError: u'\xd0\xb2\xd1\x82\xd0\xbe\xd1\x80\xd0\xbd\xd0\xb8\xd0\xba'

How is non unicode characters there at all? And why is everything ok locally? I've tried all variations of decoding\encoding placing in my code - nothing helped. I'm stuck for a few days now.

UPD: also, if I add to my script on GAE:

print type(weekdaysD.keys()[0]), type(td.text.split(u',')[0]) 

It returns both as 'unicode'. So, I belive that html was decoded correctly. Could it be something with lxml on GAE?

stachern
  • 1
  • 3
  • Looks like you're somehow getting the page as UTF-8 instead of CP1251. It could be sniffing the user-agent, though I've never seen a site do that to determine the encoding it uses. – Thomas K Mar 20 '12 at 19:22
  • Nope, using .decode('utf-8') instead gives:UnicodeDecodeError: 'utf8' codec can't decode byte 0xc1 in position 286: invalid start byte – stachern Mar 20 '12 at 20:27
  • Well, if you take the string that gives you an error and do `s.encode('latin1').decode('utf-8')`, you get a correct string. So something is getting the encoding wrong - it could be lxml. – Thomas K Mar 20 '12 at 20:46
  • 1
    Does your file start with a PEP 263 source encoding marker? I'd guess your local installation is assuming a cp1251 encoding, while App Engine assumes ASCII. – Wooble Mar 20 '12 at 22:31
  • The source code itself is in UTF-8 and i have: # -*- coding: UTF-8 -*- set. AS of s.encode('latin1').decode('utf-8') - this could be a workaround - at least i don't get same error in that place anymore. But I got some more - will check what I can do. – stachern Mar 21 '12 at 07:01
  • Those are unicode characters - hence the 'u' in front of the string - just not printable ones. It looks like your data may actually be in a 2-byte encoding of some sort, possibly UTF-16 or UCS-2? It's not really possible to say what the problem is, since you haven't provided enough detail. You need to log the contents of the string at various points, to determine where it's not what you expect it to be. – Nick Johnson Mar 21 '12 at 07:03
  • I'm pretty sure that input string is cp1251. I will collect more debugging info and add it to the question. – stachern Mar 21 '12 at 07:05

2 Answers2

1

That string you got in the error message has unicode for its type but the contents is actually the bytes that would be the UTF-8 encoding of вторник. It would be helpful if you showed us the code that does the urlfetch call, since there is nothing wrong with the code you are showing.

Guido van Rossum
  • 16,690
  • 3
  • 46
  • 49
  • Here is the urlfetch code: `def get_schedule_week(self, message, data): result = urlfetch.fetch(url=url, payload=data, method=urlfetch.POST, headers={'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8', 'User-Agent':'Mozilla/5.0 (Windows NT 6.1; rv:11.0) Gecko/20100101 Firefox/11.0'}) if result.status_code == 200: return result.content` The returned response is html_in_cp1251 as described in the question – stachern Mar 21 '12 at 19:19
  • Can you do me a favor and log result.content.__class__ somewhere both in the dev_appserver and in production? I'm curious if maybe it's str in dev_appserver but unicode in production. Another thing you might want to log would be repr(result.content) -- it should be identical in both cases. But maybe that's too long a string; perhaps repr(result.content[:500]) ? – Guido van Rossum Mar 22 '12 at 04:26
  • Sorry it took me so long. Here is local output(everything works fine):`result.content.__class__: ` `repr(result.content): ' \r\n\r\n\r\n\t\r\n\t – stachern Mar 23 '12 at 20:04
  • And here is one form the App Engine(exactly the same):`result.content.__class__: ` `repr(result.content): ' \r\n\r\n\r\n\t\r\n\t – stachern Mar 23 '12 at 20:07
  • Could it be the problem with lxml then? I can't see any other reason for it not to work. – stachern Mar 23 '12 at 20:08
  • I also added `for x in weekdaysD.keys(): logging.debug('key:%s' % type(x)) logging.debug("leaf parse: %s" % type(leaf.strip_symbols(leaf.strip_accents(htmltext.decode('cp1251')))))` and it returns same output in App engine and locally `DEBUG 2012-03-23 20:16:40,792 scheduleparser.py:23] key: DEBUG 2012-03-23 20:16:40,851 scheduleparser.py:24] leaf parse: ` and `logging.debug("tr_text: %s" % type(td.text.split(u',')[0]))` also is a type of unicode both places. ` – stachern Mar 23 '12 at 20:18
0

Well, the workaround of adding .encode('latin1').decode('utf-8', 'ignore') did the trick. I wish I could explain why it behaves so.

stachern
  • 1
  • 3