Spynner wrong encoding

Question

I'm trying to download this page - https://itunes.apple.com/ru/app/farm-story/id367107953?mt=8 (looks like this for me in Russia - http://screencloud.net/v/6a7o) via spynner in python - it uses some javascript checking so one does not simply download it without full browser emulation.

My code:

# -*- coding: utf-8 -*-
import sys
reload(sys)
sys.setdefaultencoding('utf-8')

from StringIO import StringIO
import spynner


def log(str, filename_end):
    filename = '/tmp/apple_log_%s.html' % filename_end
    print 'logged to %s' % filename
    f = open(filename, 'w')
    f.write(str)
    f.close()

debug_stream = StringIO()
browser = spynner.Browser(debug_level=3, debug_stream=debug_stream)

browser.load("https://itunes.apple.com/ru/app/farm-story/id367107953?mt=8")

ret = browser.contents
log(ret, 'noenc')

print 'content length = %s' % len(ret)
browser.close()
del browser

f=open('/tmp/apple_log_debug', 'w')
f.write(debug_stream.getvalue())
f.close()
print 'log stored in /tmp/debug_log'

So, the problem is: either apple, either spynner work wrong with Cyrillic symbols. I see them fine if I try browser.show() after loading, but in the code and logs they are still wrong encoded like <meta content="ÐÐ¾Ð»ÑÑÐ¸ÑÑ Farm Storyâ¢ Ð² App Store. ÐÑÐ¾ÑÐ¼Ð¾ÑÑÐµÑÑ ÑÐºÑÐ¸Ð½ÑÐ¾ÑÑ Ð¸ ÑÐµÐ¹ÑÐ¸Ð½Ð³Ð¸, Ð¿ÑÐ¾ÑÐ¸ÑÐ°ÑÑ Ð¾ÑÐ·ÑÐ²Ñ Ð¿Ð¾ÐºÑÐ¿Ð°ÑÐµÐ»ÐµÐ¹." property="og:description">.

http://2cyr.com/ Says that it is a utf-8 text displayed like iso-8859-1...

As you see - I don't use any headers in my request, but if I take them from chrome's network debug console and pass it to load() method e.g. headers=[('Accept-Encoding', 'utf-8'), ('Accept-Language', 'ru-RU,ru;q=0.8,en-US;q=0.6,en;q=0.4')] - I get the same result.

Also, from the same network console you can see that chrome uses gzip,deflate,sdch as Accept-Encoding. I can try that too, but I fail to decode what I get: <html><head></head><body>ï¿½ï¿½}ksÇï¿½g!ï¿½ï¿½ï¿½4ï¿½I/zï¿½Oï¿½ï¿½ï¿½/)ï¿½(ywï¿½ï¿½ï¿½é®iï¿½ï¿½{ï¿½<vï¿½ï¿½ï¿½:ï¿½ï¿½Ù·ï¿½Ø³-?ï¿½bï¿½bï¿½ï¿½ jï¿½... even if I remove the tags from the begin and end of the result.

Any help?

score 1 · Answer 1 · answered Jul 09 '15 at 15:02

Basically, browser.webframe.toHtml() returns a QTString in which case str() won't help if res actually has unicode non-latin characters. If you want to get a Python unicode string you need to do:

ret = unicode(browser.webframe.toHtml().toUtf8(), encoding="UTF-8")
#if you want to get rid of non-latin text
ret = ret.encode("ascii", errors="replace") # encodes to bytestring

in case you suspect its in Russian you could decode it to a Russian multibyte oem string (sill a bytestring) by doing

ret = ret.encode("cp1251", errors="replace") # encodes to Win-1251
# or
ret = ret.encode("cp866", errors="replace")  # encodes to windows/dos console

Only then you can save it to an ASCII file.

score 0 · Answer 2 · answered Nov 02 '14 at 16:11

0

str(browser.webframe.toHtml()) saved me

answered Nov 02 '14 at 16:11

scythargon

3,363
3
32
62

1

You should try to explain the solution a little bit more than just a single line of code for people who find this question later. – gloomy.penguin Nov 02 '14 at 16:35

Spynner wrong encoding

2 Answers2