I am trying to get the raw html code from thousands of urls of the same webpage. My code looks as follow:
list_links = ['...'] # a list of urls of different articles of the same webpage
dict_link_html = {link: [urllib.request.urlopen(link).read()] for link in list_links}
The problem is that the requests sometimes returns an empty string. I understood that the error is an HTML error 509, by using this try and catch statement.
for link in list_links:
try:
html = urllib.request.urlopen(link).read()
except urllib.error.HTTPError as e:
print(e, e.code, e.reason, e.headers)
My question is, how to circumvent this bandwidth error? The code above seemed to work for other webpages, but this webpage seems to have restrictions on its bandwidth. Using the time module to delay the time of subsequent requests helped, but there are still cases where the request returns the 509 error.
Thanks a lot for any help!