The scraper should be downloading documents from a list of urls which was scraped earlier.
I don't have issue running it when I'm using my office network but when I run the scraper at home using my home's wifi, the scraper keeps giving the same error.
I tried some suggestions from another post here - by giving a timeout variable. Python HTTPConnectionPool Failed to establish a new connection: [Errno 11004] getaddrinfo failed
However it doesn't solve the problem.
I would appreciate some explanation along with a solution. I'm not well-versed with network issue. Thank you
import requests
from bs4 import BeautifulSoup
import re
import pandas as pd
import os
doc_urls= [
'http://www.ha.org.hk/haho/ho/bssd/19d079Pa.htm',
'http://www.ha.org.hk/haho/ho/bssd/18S065Pg.htm',
'http://www.ha.org.hk/haho/ho/bssd/19d080Pa.htm',
'http://www.ha.org.hk/haho/ho/bssd/NTECT6AT003Pa.htm',
'http://www.ha.org.hk/haho/ho/bssd/19D093Pa.htm',
'http://www.ha.org.hk/haho/ho/bssd/19d098Pa.htm',
'http://www.ha.org.hk/haho/ho/bssd/19d103Pa.htm',
'http://www.ha.org.hk/haho/ho/bssd/18G044Pe.htm',
'http://www.ha.org.hk/haho/ho/bssd/19d104Pa.htm',
]
base_url = "http://www.ha.org.hk"
for doc in doc_urls:
with requests.Session() as session:
r = session.get(doc)
# get all documents links
docs = BeautifulSoup(r.text, "html.parser").select("a[href]")
print('Visiting:',doc)
for doc in docs:
href = doc.attrs["href"]
name = doc.text
print(f">>> Downloading file name: {name}, href: {href}")
# open document page
r = session.get(href)
# get file path
file_path = re.search("(?<=window.open\\(')(.*)(?=',)", r.text).group(0)
file_name = file_path.split("/")[-1]
# get file and save
r = session.get(f"{base_url}/{file_path}")
with open('C:\\Users\\Desktop\\tender_documents\\' + file_name, 'wb') as f:
f.write(r.content)
As mentioned, the scraper runs well on my office's network. It fails when I tried using my own wifi, and also my mother in law's wifi. My mother in law and I use the same wifi provider - if that helps.