0

I've written some code in python to download files from a webpage. As i do not have any idea how to download files from any site so i could only scrape the file links from that site. If someone could help me achieve that I would be very grateful to him. Thanks a lot in advance.

Link to that site: web_link

Here is my try:

from bs4 import BeautifulSoup
import requests

response = requests.get("http://usda.mannlib.cornell.edu/MannUsda/viewDocumentInfo.do?documentID=1194")
soup = BeautifulSoup(response.text,"lxml")
for item in soup.select("#latest a"):
    print(item['href'])

Upon execution, the above script produces four different urls to those files.

SIM
  • 21,997
  • 5
  • 37
  • 109

2 Answers2

2

You can use request.get:

import requests
from bs4 import BeautifulSoup

response = requests.get("http://usda.mannlib.cornell.edu/MannUsda/"
                        "viewDocumentInfo.do?documentID=1194")
soup = BeautifulSoup(response.text, "lxml")
for item in soup.select("#latest a"):
    filename = item['href'].split('/')[-1]
    with open(filename, 'wb') as f:
        f.write(requests.get(item['href']).content)
etaloof
  • 642
  • 9
  • 21
1

You can go with a standard library's urllib.request.urlretrieve(), but, since you are already using requests, you can re-use the session here (download_file was largely taken from this answer):

from bs4 import BeautifulSoup
import requests


def download_file(session, url):
    local_filename = url.split('/')[-1]

    r = session.get(url, stream=True)
    with open(local_filename, 'wb') as f:
        for chunk in r.iter_content(chunk_size=1024):
            if chunk: # filter out keep-alive new chunks
                f.write(chunk)

    return local_filename


with requests.Session() as session:
    response = session.get("http://usda.mannlib.cornell.edu/MannUsda/viewDocumentInfo.do?documentID=1194")
    soup = BeautifulSoup(response.text,"lxml")
    for item in soup.select("#latest a"):
        local_filename = download_file(session, item['href'])
        print(f"Downloaded {local_filename}")
alecxe
  • 462,703
  • 120
  • 1,088
  • 1,195
  • Lucky to have you sir alecxe. It has been a while. However, a little issue I'm facing when it hits the `print` line. It breaks there. – SIM Dec 13 '17 at 22:35
  • @Topto you have to use Python 3.6 to use string with prefix `f` as in example - but you can use old `print("Downloaded", local_filename)` – furas Dec 13 '17 at 22:38