download file over https query with python headless browser

Question

I try to do web scraping in python on a website (using spynner and BeautifulSoup). At some point I want to test a zip file download, triggered by the following html query:

https://mywebsite.com/download?from=2011&to=2012

If explicitly used in a browser (chrome) this will trigger the download of a zip file with a given name. I have not been able to reproduce this behavior with my headless browser. I know it's not the right way to do it but using something like spynner:

from spynner import Browser
b = Browser()
b.load(webpage,wait_callback=wait_page_load, tries=3)
b.load_jquery(True)
...
output = b.load("https://website.com/download?from=2011&to=2012")
print b.html
>> ...

does not work of course (no zip file download). The last print statement shows I end up on an error page, with a java exception stack.

Is there a way to

properly call the html query without using the spynner load mechanism?
capture the resulting zip file?
download it with a chosen name?

Thanks for your help.

One last thing that came after some testing on chrome with the java debugger, I have the following warning when doing it in the browser:

Resource interpreted as Document but transferred with MIME type application/zip "https://mywebsite.com/download?from=2011&to=2012"

Edited:

Found out that the call made was:

https://mywebsite.com/download?from=10%2F18%2F2011&to=10%2F18%2F2012

which can be used in a browser and should be replaced by

https://mywebsite.com/download?from=10/18/2011&to=10/18/2012

which could not be used in python because the URL encoding would map %2F into %252F

andrean · Accepted Answer · 2012-10-19T07:26:29.943

2

I'm not sure if this will handle your case, but give it a try:

def download_finished(reply):
    try:
        with open('filename.ext', 'wb') as downloaded_file:
            downloaded_file.write(reply.readAll())
    except Exception:
        pass

    b.manager.finished.disconnect(download_finished)

download_url = spynner.QUrl(url)
request = spynner.QNetworkRequest(download_url)

# requires: from PyQt4.QtCore import QByteArray
request.setRawHeader('Accept', QByteArray(
    'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8'))

b.manager.finished.connect(download_finished)
reply = b.manager.get(request)
b.wait_requests(1)

edited Oct 19 '12 at 07:26

answered Oct 19 '12 at 05:12

andrean

6,717
2
36
43

Hi @andrean. It did not work, `filename.ext` contains an error page basically. But once again, I don't have the path to the file I want to download. I have an html query that triggers the download. So my guess is that trying to load the html query as a page is an error. It is not meant to be loaded, it is meant to activate a function on the website that triggers the possibility to save a file. Does that make sense? – RockridgeKid Oct 19 '12 at 05:38
In that downloaded html file, do you see a new link maybe that will redirect to the actual download url? or is there a javascript snippet in it maybe which starts the download? – andrean Oct 19 '12 at 06:09
No it does not. However it contains a java exception stack and I wonder if this is javascript in the page that is not working or if it is the server that throws me back the error I created with my http request. – RockridgeKid Oct 19 '12 at 06:16
That would mean that the backend application actually crashed because of the request? and if it surely works in chrome, that's weird.. Does the page require Java Runtime Environment when you load it in a real browser? Because Java is disabled by default in QtWebKit – andrean Oct 19 '12 at 06:19
Ok, I just found that the query was different and included `%2F` instead of `/`. This was causing an error because `%2F` was mapped into `%252F` by the url encoding. I just updated my definition of the problem. So your solution is working except that it ends up waiting and not stopping. – RockridgeKid Oct 19 '12 at 07:18
sorry that's my mistake, pass an integer to b.wait_requests, not the reply object... – andrean Oct 19 '12 at 07:26
Which version of spynner do you use? I have an execution error `AttributeError: 'Browser' object has no attribute 'manager'` – RockridgeKid Oct 19 '12 at 07:38
I use spynner 1.10, check out the spynner.py file, and see where it assigns a QNetworkAccessManager instance to a Browser class attribute.. That's how I found the manager.. – andrean Oct 19 '12 at 07:51
in request.setRawHeader, shall I include `application/zip`? Because the zip file it creates are empty. As for the manager, I still get the error... – RockridgeKid Oct 22 '12 at 07:58
how did you test it if you get an error for the manager? an instance of the spynner Browser class has a manager attribute in all versions, I'm not sure how did you try to access it.. – andrean Oct 22 '12 at 08:52
now it works and I cannot reproduce the initial error. I think I will settle with that. – RockridgeKid Oct 23 '12 at 00:16

score 0 · Answer 2 · answered Oct 19 '12 at 00:17

0

You've made a mistake with spynner.

The script should looks like :

from spynner import Browser
b = Browser()
b.load(webpage,wait_callback=wait_page_load, tries=3)
b.load_jquery(True)
...
b.load("https://website.com/download?from=2011&to=2012")
# print b.html
f = open("/tmp/foo.zip", "w")
f.write(b.html)
f.close()

See spynner doc

answered Oct 19 '12 at 00:17

Gilles Quénot

173,512
41
224
223

Thanks. You're right but it is not changing anything in the outcome, just printing the html content of an error loading page. – RockridgeKid Oct 19 '12 at 00:23
So you're wrong somewhere, but I can't help, I don't have real URL. – Gilles Quénot Oct 19 '12 at 00:26
Thanks @sputnick - I cannot share this URL, it's behind a login/password anyway. My issue is that if I copy the string `https...` and I paste it in chrome, it works (meaning it downloads the file). – RockridgeKid Oct 19 '12 at 00:44
You should enable debug mode & compare headers between real browser & spynner. – Gilles Quénot Oct 19 '12 at 01:07
I tried in another browser and this time a popup shows up to ask me to save the file. So the http query triggers this pop-up. I am going to explore the debug mode. Thanks @sputnick – RockridgeKid Oct 19 '12 at 04:35

r_31415 · Answer 3 · 2012-10-19T06:14:34.573

0

Does the following code work?

import urllib, os, urlparse

url = YOUR_URL

file = urllib.URLopener()
file.retrieve(url, os.path.basename(urlparse.urlparse(url).path))
print 'downloading:', url

edited Oct 19 '12 at 06:14

answered Oct 19 '12 at 00:18

r_31415

8,752
17
74
121

Well, of course. I meant to show an usual situation. If you're getting HTML then use only c.read() without json.loads() and try to get all the links to files you want to download in an array or is this only one zip file? – r_31415 Oct 19 '12 at 02:09
Thanks @robert. My issue is that I have a http query `https://website.com/download?from=2011&to=2012` that launches a download of a zipfile. So I don't have a html file, and I don't have any link giving me a path. I do not have any path. Otherwise retrieving would be easy for sure. – RockridgeKid Oct 19 '12 at 04:32
Right. I updated my answer and that should work, however, I think that what can be happening is that your url is not really launching the zip file but instead it's calling a mirror. If that is the case, then you need to fetch the source of that file. – r_31415 Oct 19 '12 at 06:16

download file over https query with python headless browser

3 Answers3