1

I am trying to get the raw html code from thousands of urls of the same webpage. My code looks as follow:

list_links = ['...'] # a list of urls of different articles of the same webpage

dict_link_html = {link: [urllib.request.urlopen(link).read()] for link in list_links} 

The problem is that the requests sometimes returns an empty string. I understood that the error is an HTML error 509, by using this try and catch statement.

for link in list_links:
  try: 
    html = urllib.request.urlopen(link).read()
  except urllib.error.HTTPError as e:
    print(e, e.code, e.reason, e.headers)

My question is, how to circumvent this bandwidth error? The code above seemed to work for other webpages, but this webpage seems to have restrictions on its bandwidth. Using the time module to delay the time of subsequent requests helped, but there are still cases where the request returns the 509 error.

Thanks a lot for any help!

Gerrito
  • 29
  • 5
  • Have you tried maybe changing your user agent or faking the ip address? The website may be blacklisting you, however it seems that error 509 is server side and cannot be circumvented. – Game Developement Jul 04 '22 at 20:20
  • Its bandwidth, so go slower. You could add a sleep / retry when that happens. You say its intermittent so you likely aren't blacklisted. You may even find that a pause per X number of requests is in order. Generally one should respect a web site's bandwidth limitations. – tdelaney Jul 04 '22 at 20:24
  • You are right, going slower helps. It'll take some time to get the content, but I think this is the best way to do it. Thanks for your answers! – Gerrito Jul 04 '22 at 20:37
  • Just in case someone else faces a similiar issue: using time and writing a retry function helped me out in this case! The code than is not super fast, but it does its job. Thanks for your help! – Gerrito Jul 05 '22 at 07:01

0 Answers0