Downloading all links on a webpage using Mechanize in Python

Question

I was trying to follow the following thread which seemed to answer my question. It serves as a great example that shows how to download all links on a webpage using Mechanize:

Download all the links(related documents) on a webpage using Python

I followed the code that was posted (i.e.):

import mechanize
from time import sleep
#Make a Browser (think of this as chrome or firefox etc)
br = mechanize.Browser()

#visit http://stockrt.github.com/p/emulating-a-browser-in-python-with-mechanize/
#for more ways to set up your br browser object e.g. so it look like mozilla
#and if you need to fill out forms with passwords.

# Open your site
br.open('http://pypi.python.org/pypi/xlwt')

f=open("source.html","w")
f.write(br.response().read()) #can be helpful for debugging maybe

filetypes=[".zip",".exe",".tar.gz"] #you will need to do some kind of pattern matching on your files
myfiles=[]
for l in br.links(): #you can also iterate through br.forms() to print forms on the page!
    for t in filetypes:
        if t in str(l): #check if this link has the file extension we want (you may choose to use reg expressions or something)
            myfiles.append(l)


def downloadlink(l):
    f=open(l.text,"w") #perhaps you should ensure that file doesn't already exist.

    br.click_link(l)
    f.write(br.response().read())
    print l.text," has been downloaded"
    #br.back()

for l in myfiles:
    sleep(1) #throttle so you dont hammer the site
    downloadlink(l)

i only changed:

f=open(l.text,"w") #perhaps you should open in a better way & ensure that file doesn't already exist.

To:

f=open('C:\\l.text',"w") #perhaps you should open in a better way & ensure that file doesn't already exist.

That made the code work for me, else it was giving me an error. When i run the code, i get the following output:

Download> xlwt-0.7.5.tar.gz has been downloaded 
xlwt-0.7.5.tar.gz has been downloaded

So it worked. But i have no idea where this file was downloaded to? Any ideas? I have searched my C drive, and could not find it.

If the code is run as:

f=open(l.text,"w")

It raises the following exception:

Traceback (most recent call last):
  File "C:\Python27\mech.py", line 33, in <module>
downloadlink(l)
  File "C:\Python27\mech.py", line 25, in downloadlink
f=open(l.text,"w") #perhaps you should ensure that file doesn't already exist.
IOError: [Errno 22] invalid mode ('w') or filename: 'Download> <span style="font-size: 75%">xlwt-0.7.5.tar.gz<span>'

Please check that I have re-indented your code correctly. As posted, it would not compile. — holdenweb, Jul 02 '14 at 00:27
Thank you for the correction. I made a mistake in editing to stackoverflow. My python script looks like the corrected code. — Code, Jul 02 '14 at 00:32
Also, please include the error message (in fact, the complete stack trace) to confirm exactly what exception was raised. — holdenweb, Jul 02 '14 at 00:40

holdenweb · Accepted Answer · 2014-07-02T00:39:24.197

2

The Python code you quoted uses the text attribute of the link l (hence the expression l.text) as the filename. Consequently (since each link should hopefully have a different text attribute value) the code should produce a number of files, one for each link.

Your change replaces a variable expression (one which has a different value for each link) with a constant. So each file is being written to the C:\ directory as l.text. Consequently when you look at this file you should see the contexts of the last link on the page.

(By the way, not your fault I know, but l is a very bad name for a variable due to its potential for confusion with the digit one).

The correct way to run this program is inside an empty directory (otherwise the individual files will be hard to track down) on which you have write permission. If any of the filenames contain slashes then you will have to take special pains to either create the necessary directory structure or transform them somehow into acceptable Windows filenames.

You may also wish to replace the detection code with something a little more colloquial.

for l in br.links(): #you can also iterate through br.forms() to print forms on the page!
    s = str(l)
    if any(s.endswith(t) for t in filetypes):
        myfiles.append(l)

edited Jul 02 '14 at 00:39

answered Jul 02 '14 at 00:33

holdenweb

33,305
7
57
77

I think i have a little better understanding of it now. If i change the changed line back to f=open(l.text,"w"), then my output gives the following error: IOError: [Errno 22] invalid mode ('w') or filename: 'Download> xlwt-0.7.5.tar.gz' – Code Jul 02 '14 at 00:39
I guess i am not sure how to point to the empty directory once i have created one in the C drive. – Code Jul 02 '14 at 00:42
In the shell window (the one, presumably, with a `C:>` prompt) use the `cd` command to change your process's current directory tot he empty directory (the one you made, presumably, with a `mkdir` command at the same prompt). – holdenweb Jul 02 '14 at 00:52
That's very helpful. The code appears to make assumptions about the link content that aren't true for your data. This could be due to the inclusion of styling information which was absent in the original author's data. Are we to assume you would have liked that link to download to a file named `"xlwt-0.7.5.tar.gz"`? It shouldn't be difficult to remove the `` tags, but I am wondering where the `"Download>"` portion comes from. – holdenweb Jul 02 '14 at 00:59
I got a little further by the following method: i saved the code in a script file called 'mech.py'. Then i created an empty folder 'empty' in C:\Python27\. After this i opened the command prompt, went in the empty folder, and called 'python C:\Python27\mech.py'. That actually created a file inside that empty folder called 'source.html'. However, my initial goal was to download all files of a certain format from a webpage into a directory. I thought this code would have done that? – Code Jul 02 '14 at 01:24

Downloading all links on a webpage using Mechanize in Python

1 Answers1