I am using gevent to download some html pages. Some websites are way too slow, some stop serving requests after period of time. That is why I had to limit total time for a group of requests I make. For that I use gevent "Timeout".
timeout = Timeout(10)
timeout.start()
def downloadSite():
# code to download site's url one by one
url1 = downloadUrl()
url2 = downloadUrl()
url3 = downloadUrl()
try:
gevent.spawn(downloadSite).join()
except Timeout:
print 'Lost state here'
But the problem with it is that i loose all the state when exception fires up.
Imagine I crawl site 'www.test.com'. I have managed to download 10 urls right before site admins decided to switch webserver for maintenance. In such case i will lose information about crawled pages when exception fires up.
The question is - how do I save state and process the data even if Timeout happens ?