1

I have a web scraping program that downloads a web page a few times every hour. On about one out of 15 or 20 attempts I get:

[Errno 10054] An existing connection was forcibly closed by the remote host

or

[Errno 10060] A connection attempt failed because the connected party did not properly respond after a period of time, or established connection failed because connected host has failed to respond

Is there a better approach than:

def get_page(url):
    def get_page_once(url):
        try:
            page = opener.open(url).read()
        except Exception as e:
            print('Failed to download %s: %s' % (url,e))
            page = ''
        return page
        
    opener = urllib2.build_opener()
    opener.addheaders = [('User-agent', 'Mozilla/5.0 (Windows NT 6.1; Win64; x64; rv:25.0) Gecko/20100101 Firefox/25.0')]

    page = get_page_once(url)
    if page == '':
        time.sleep(2)
        page = get_page_once(url)
        
    return page

I could do more than one retry, but am worried about spending too much time in this function.

foosion
  • 7,619
  • 25
  • 65
  • 102
  • 1
    What do you mean by too much time? If it doesn't do what it's meant to do it doesn't matter how quick it is. – wdh Dec 18 '13 at 14:32
  • The program does lots of things. If it stays on this problem it has to ignore other things. – foosion Dec 18 '13 at 14:38

0 Answers0