1

The scraper should be downloading documents from a list of urls which was scraped earlier.

I don't have issue running it when I'm using my office network but when I run the scraper at home using my home's wifi, the scraper keeps giving the same error.

I tried some suggestions from another post here - by giving a timeout variable. Python HTTPConnectionPool Failed to establish a new connection: [Errno 11004] getaddrinfo failed

However it doesn't solve the problem.

I would appreciate some explanation along with a solution. I'm not well-versed with network issue. Thank you

import requests
from bs4 import BeautifulSoup
import re
import pandas as pd
import os

doc_urls= [
'http://www.ha.org.hk/haho/ho/bssd/19d079Pa.htm',
'http://www.ha.org.hk/haho/ho/bssd/18S065Pg.htm',
'http://www.ha.org.hk/haho/ho/bssd/19d080Pa.htm',
'http://www.ha.org.hk/haho/ho/bssd/NTECT6AT003Pa.htm',
'http://www.ha.org.hk/haho/ho/bssd/19D093Pa.htm',
'http://www.ha.org.hk/haho/ho/bssd/19d098Pa.htm',
'http://www.ha.org.hk/haho/ho/bssd/19d103Pa.htm',
'http://www.ha.org.hk/haho/ho/bssd/18G044Pe.htm',
'http://www.ha.org.hk/haho/ho/bssd/19d104Pa.htm',
]

base_url = "http://www.ha.org.hk"

for doc in doc_urls:
    with requests.Session() as session:
        r = session.get(doc)
        # get all documents links
        docs = BeautifulSoup(r.text, "html.parser").select("a[href]")
        print('Visiting:',doc)
        for doc in docs:
            href = doc.attrs["href"]
            name = doc.text
            print(f">>> Downloading file name: {name}, href: {href}")
            # open document page
            r = session.get(href)
            # get file path
            file_path = re.search("(?<=window.open\\(')(.*)(?=',)", r.text).group(0)
            file_name = file_path.split("/")[-1]
            # get file and save
            r = session.get(f"{base_url}/{file_path}")
            with open('C:\\Users\\Desktop\\tender_documents\\' + file_name, 'wb') as f:
                f.write(r.content)

As mentioned, the scraper runs well on my office's network. It fails when I tried using my own wifi, and also my mother in law's wifi. My mother in law and I use the same wifi provider - if that helps.

Afiq Johari
  • 1,372
  • 1
  • 15
  • 28

0 Answers0