Download all the links(related documents) on a webpage using Python

Question

I have to download a lot of documents from a webpage. They are wmv files, PDF, BMP etc. Of course, all of them have links to them. So each time, I have to RMC a file, select 'Save Link As' Then save then as type All Files. Is it possible to do this in Python? I search the SO DB and folks have answered question of how to get the links from the webpage. I want to download the actual files. Thanks in advance. (This is not a HW question :)).

score 25 · Accepted Answer · edited May 23 '17 at 11:46

Here is an example of how you could download some chosen files from http://pypi.python.org/pypi/xlwt

you will need to install mechanize first: http://wwwsearch.sourceforge.net/mechanize/download.html

import mechanize
from time import sleep
#Make a Browser (think of this as chrome or firefox etc)
br = mechanize.Browser()

#visit http://stockrt.github.com/p/emulating-a-browser-in-python-with-mechanize/
#for more ways to set up your br browser object e.g. so it look like mozilla
#and if you need to fill out forms with passwords.

# Open your site
br.open('http://pypi.python.org/pypi/xlwt')

f=open("source.html","w")
f.write(br.response().read()) #can be helpful for debugging maybe

filetypes=[".zip",".exe",".tar.gz"] #you will need to do some kind of pattern matching on your files
myfiles=[]
for l in br.links(): #you can also iterate through br.forms() to print forms on the page!
    for t in filetypes:
        if t in str(l): #check if this link has the file extension we want (you may choose to use reg expressions or something)
            myfiles.append(l)


def downloadlink(l):
    f=open(l.text,"w") #perhaps you should open in a better way & ensure that file doesn't already exist.
    br.click_link(l)
    f.write(br.response().read())
    print l.text," has been downloaded"
    #br.back()

for l in myfiles:
    sleep(1) #throttle so you dont hammer the site
    downloadlink(l)

Note: In some cases you may wish to replace br.click_link(l) with br.follow_link(l). The difference is that click_link returns a Request object whereas follow_link will directly open the link. See Mechanize difference between br.click_link() and br.follow_link()

robert kink, i run your code for only download zip files- the code run with no errors but in the chrom download folder i don't see the files — newGIS, May 31 '16 at 11:02
hmm i think the file will get downloaded to the folder that you ran the python script from. see http://stackoverflow.com/questions/5137497/find-current-directory-and-files-directory — Rusty Rob, May 31 '16 at 22:04
people could also consider pupeteer? https://pypi.org/project/pyppeteer/ — Rusty Rob, Sep 13 '19 at 03:21
@newGIS I faced the same problem. Replacing br.click_link(l) with the following statement worked for me: br.retrieve(str(l.url), f'{l.text}.mp3')[0] — Yellowjacket11, Dec 29 '20 at 22:21

score 6 · Answer 2 · edited May 23 '17 at 12:09

6

Follow the Python codes in this link: wget-vs-urlretrieve-of-python.
You can also do this very easily with Wget. Try --limit, --recursive and --accept command-lines in Wget. For example: wget --accept wmv,doc --limit 2 --recursive http://www.example.com/files/

edited May 23 '17 at 12:09

Community

1
1

answered May 12 '11 at 07:28

gsbabil

7,505
3
26
28

Download all the links(related documents) on a webpage using Python

2 Answers2

Linked