0

I am trying to download all pdf files which contain scanned school books from a website. I tried using wget but it doesn't work. I suspect this is due to the website being an ASP-page with a selection options to select the course/year.

I also tried selecting a certain year/course and saving the html file locally but this doesn't work either

from bs4 import BeautifulSoup as bs
import urlopen
import wget
from urllib import parse as urlparse

def get_pdfs(my_url):
    links = []
    html = urlopen(my_url).read()
    html_page = bs(html, features="lxml") 
    og_url = html_page.find("meta",  property = "og:url")
    base = urlparse(my_url)
    print("base",base)
    for link in html_page.find_all('a'):
        current_link = link.get('href')
        if current_link.endswith('pdf'):
            if og_url:
                print("currentLink",current_link)
                links.append(og_url["content"] + current_link)
            else:
                links.append(base.scheme + "://" + base.netloc + current_link)

    for link in links:
        try: 
            wget.download(link)
        except:
            print(" \n \n Unable to Download A File \n")


my_url = 'https://www.svpo.nl/curriculum.asp'
get_pdfs(my_url)

my_url_local_html = r'C:\test\en_2.html' # downloaded year 2 english books page locally to extract pdf links
get_pdfs(my_url_local_html )

snipplet of my_url_local_html with links to pdf:

            <li><a target="_blank" href="https://www.ib3.nl/curriculum/engels\010 TB 2 Ch 5.pdf">Chapter 5 - Going extreme</a></li>
        
            <li><a target="_blank" href="https://www.ib3.nl/curriculum/engels\020 TB 2 Ch 6.pdf">Chapter 6 - A matter of taste</a></li>

1 Answers1

0

You need to specify payload. For example Engels and 2e klas

url = "https://www.svpo.nl/curriculum.asp"
payload = 'vak=Engels&klas_en_schoolsoort=2e klas'
headers = {
    'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9',
    'Content-Type': 'application/x-www-form-urlencoded',
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/103.0.0.0 Safari/537.36'
}

response = requests.post(url, data=payload, headers=headers)
for link in BeautifulSoup(response.text, "lxml").find_all('a'):
    current_link = link.get('href')
    if current_link.endswith('pdf'):
        print(current_link)

OUTPUT:

https://www.ib3.nl/curriculum/engels\010 TB 2 Ch 5.pdf
https://www.ib3.nl/curriculum/engels\020 TB 2 Ch 6.pdf
https://www.ib3.nl/curriculum/engels\030 TB 2 Ch 7.pdf
https://www.ib3.nl/curriculum/engels\040 TB 2 Ch 8.pdf
https://www.ib3.nl/curriculum/engels\050 TB 2 Ch 9.pdf
https://www.ib3.nl/curriculum/engels\060 TB 2 Reading matters.pdf
https://www.ib3.nl/curriculum/engels\080 TB 3 Ch 1.pdf
https://www.ib3.nl/curriculum/engels\090 TB 3 Ch 2.pdf
https://www.ib3.nl/curriculum/engels\100 TB 3 Ch 3.pdf
https://www.ib3.nl/curriculum/engels\110 TB 3 Ch 4.pdf
https://www.ib3.nl/curriculum/engels\120 TB 3 Ch 5.pdf
https://www.ib3.nl/curriculum/engels\130 TB 3 Ch 6.pdf
https://www.ib3.nl/curriculum/engels\140 TB 3 Ch 7.pdf
https://www.ib3.nl/curriculum/engels\150 TB 3 Ch 8.pdf
https://www.ib3.nl/curriculum/engels\160 TB 3 Reading matters.pdf
https://www.ib3.nl/curriculum/engels\170 TB 3 Grammar.pdf
https://www.ib3.nl/curriculum/engels\Grammar Survey StSt 2.pdf
https://www.ib3.nl/curriculum/engels\StSt 2 Reading Matters.pdf
https://www.ib3.nl/curriculum/engels\StSt2 Yellow Pages.pdf
https://www.ib3.nl/curriculum/engels\050 WB 2 Ch 5.pdf
https://www.ib3.nl/curriculum/engels\060 WB 2 Ch 6.pdf
https://www.ib3.nl/curriculum/engels\070 WB 2 Ch 7.pdf
https://www.ib3.nl/curriculum/engels\080 WB 2 Ch 8.pdf
https://www.ib3.nl/curriculum/engels\090 WB 2 Ch 9.pdf
https://www.ib3.nl/curriculum/engels\110 WB 3 Ch 1.pdf
https://www.ib3.nl/curriculum/engels\115 WB 3 Ch 2.pdf
https://www.ib3.nl/curriculum/engels\120 WB 3 Ch 3.pdf
https://www.ib3.nl/curriculum/engels\125 WB 3 Ch 4.pdf
https://www.ib3.nl/curriculum/engels\130 WB 3 Ch 5.pdf
https://www.ib3.nl/curriculum/engels\135 WB 3 Ch 6.pdf
https://www.ib3.nl/curriculum/engels\140 WB 3 Ch 7.pdf
https://www.ib3.nl/curriculum/engels\145 WB 3 Ch 8.pdf

UPDATE: To save pdf from link:

with open('somefilename.pdf', 'wb') as f:
    url = r'https://www.ib3.nl/curriculum/engels\010 TB 2 Ch 5.pdf'.replace(' ', '%20').replace('\\', '/')
    response = requests.get(url)
    f.write(response.content)

@KJ right, you need to replace the spaces and the left slash

Sergey K
  • 1,329
  • 1
  • 7
  • 15
  • Thanks - I am still strugling a bit with this. I think that I should give this link than to the function getpdfs? Or can I directly use wget(my_url)? – user9118870 Sep 21 '22 at 13:46
  • @user9118870 i add code to save pdf from url u get – Sergey K Sep 21 '22 at 14:12
  • thanks, that seems very simple. But I still get an error. I first tried to insert the link as it is returned and also tried it with the link that I get copying it in a browser. E.g. with open('https://www.ib3.nl/curriculum/engels/130%20WB%203%20Ch%205.pdf', 'wb') as f: f.write(response.get(url).content) I get an error: Invalid argument: 'https://www.ib3.nl/curriculum/engels/130%20WB%203%20Ch%205.pdf' Or with open('https://www.ib3.nl/curriculum/engels\145 WB 3 Ch 8.pdf', 'wb') as f: f.write(response.get(url).content) Same error – user9118870 Sep 21 '22 at 14:36
  • Thanks! It worked and I was able to download all with Curl. One more question: The html contains a link to pdf and a title how it is displayed on the webpage. E.g.
  • Chapter 5 - Going extreme
  • Is there a solution to get the display text of a hyperlink? (so I can use this to name the pdf file saving it, rather than the pdf file name it has on the website which is not desctiptive enough) – user9118870 Sep 23 '22 at 10:08