Unable to download files from a certain website

Question

I've written some code in python to download files from a webpage. As i do not have any idea how to download files from any site so i could only scrape the file links from that site. If someone could help me achieve that I would be very grateful to him. Thanks a lot in advance.

Link to that site: web_link

Here is my try:

from bs4 import BeautifulSoup
import requests

response = requests.get("http://usda.mannlib.cornell.edu/MannUsda/viewDocumentInfo.do?documentID=1194")
soup = BeautifulSoup(response.text,"lxml")
for item in soup.select("#latest a"):
    print(item['href'])

Upon execution, the above script produces four different urls to those files.

score 2 · Answer 1 · answered Dec 13 '17 at 22:34

You can use request.get:

import requests
from bs4 import BeautifulSoup

response = requests.get("http://usda.mannlib.cornell.edu/MannUsda/"
                        "viewDocumentInfo.do?documentID=1194")
soup = BeautifulSoup(response.text, "lxml")
for item in soup.select("#latest a"):
    filename = item['href'].split('/')[-1]
    with open(filename, 'wb') as f:
        f.write(requests.get(item['href']).content)

score 1 · Accepted Answer · answered Dec 13 '17 at 22:28

You can go with a standard library's urllib.request.urlretrieve(), but, since you are already using requests, you can re-use the session here (download_file was largely taken from this answer):

from bs4 import BeautifulSoup
import requests


def download_file(session, url):
    local_filename = url.split('/')[-1]

    r = session.get(url, stream=True)
    with open(local_filename, 'wb') as f:
        for chunk in r.iter_content(chunk_size=1024):
            if chunk: # filter out keep-alive new chunks
                f.write(chunk)

    return local_filename


with requests.Session() as session:
    response = session.get("http://usda.mannlib.cornell.edu/MannUsda/viewDocumentInfo.do?documentID=1194")
    soup = BeautifulSoup(response.text,"lxml")
    for item in soup.select("#latest a"):
        local_filename = download_file(session, item['href'])
        print(f"Downloaded {local_filename}")

Lucky to have you sir alecxe. It has been a while. However, a little issue I'm facing when it hits the `print` line. It breaks there. — SIM, Dec 13 '17 at 22:35
@Topto you have to use Python 3.6 to use string with prefix `f` as in example - but you can use old `print("Downloaded", local_filename)` — furas, Dec 13 '17 at 22:38

Unable to download files from a certain website

2 Answers2