4

I am using gevent to download some html pages. Some websites are way too slow, some stop serving requests after period of time. That is why I had to limit total time for a group of requests I make. For that I use gevent "Timeout".

timeout = Timeout(10)
timeout.start()

def downloadSite():
    # code to download site's url one by one
    url1 = downloadUrl()
    url2 = downloadUrl()
    url3 = downloadUrl()
try:
    gevent.spawn(downloadSite).join()
except Timeout:
    print 'Lost state here'

But the problem with it is that i loose all the state when exception fires up.

Imagine I crawl site 'www.test.com'. I have managed to download 10 urls right before site admins decided to switch webserver for maintenance. In such case i will lose information about crawled pages when exception fires up.

The question is - how do I save state and process the data even if Timeout happens ?

Roman
  • 8,826
  • 10
  • 63
  • 103
Termos
  • 664
  • 1
  • 7
  • 31
  • Why don't you define one timeout per request? What is downloadUrl() actually doing? Is it blocking cooperatively? Can you provide a self-contained example? – Dr. Jan-Philip Gehrcke Jul 18 '13 at 13:02
  • Code is simplified. downloadSite() function contains code to get first page, find good internal links, download them, find more links, etc... I don't imagine how to wrap each request in a separate Timeout. Imho it is wrong from the programming point of view + it would generate significant impact on website (Imagine requesting 100 web pages from 'www.test.com' simultaneously) – Termos Jul 18 '13 at 13:08

2 Answers2

3

Why not try something like:

timeout = Timeout(10)

def downloadSite(url):
    with Timeout(10):
        downloadUrl(url)

urls = ["url1", "url2", "url3"]

workers = []
limit = 5
counter = 0
for i in urls:
    # limit to 5 URL requests at a time
    if counter < limit:
        workers.append(gevent.spawn(downloadSite, i))
        counter += 1
    else:
        gevent.joinall(workers)
        workers = [i,]
        counter = 0
gevent.joinall(workers)

You could also save a status in a dict or something for every URL, or append the ones that fail in a different array, to retry later.

Gabriel Samfira
  • 2,675
  • 1
  • 17
  • 19
  • Thank you, Gabriel, this works. I am a python newbie and didn't know about "with" construct :) – Termos Jul 19 '13 at 13:11
2

A self-contained example:

import gevent
from gevent import monkey
from gevent import Timeout

gevent.monkey.patch_all()
import urllib2

def get_source(url):
    req = urllib2.Request(url)
    data = None
    with Timeout(2):
        response = urllib2.urlopen(req)
        data = response.read()
    return data

N = 10
urls = ['http://google.com' for _ in xrange(N)]
getlets = [gevent.spawn(get_source, url) for url in urls]
gevent.joinall(getlets)
contents = [g.get() for g in getlets]

print contents[5]

It implements one timeout for each request. In this example, contents contains 10 times the HTML source of google.com, each retrieved in an independent request. If one of the requests had timed out, the corresponding element in contents would be None. If you have questions about this code, don't hesitate to ask in the comments.

I saw your last comment. Defining one timeout per request definitely is not wrong from the programming point of view. If you need to throttle traffic to the website, then just don't spawn 100 greenlets simultaneously. Spawn 5, wait until they returned. Then, you can possibly wait for a given amount of time, and spawn the next 5 (already shown in the other answer by Gabriel Samfira, as I see now). For my code above, this would mean, that you would have to repeatedly call

N = 10
urls = ['http://google.com' for _ in xrange(N)]
getlets = [gevent.spawn(get_source, url) for url in urls]
gevent.joinall(getlets)
contents = [g.get() for g in getlets]

whereas N should not be too high.

Dr. Jan-Philip Gehrcke
  • 33,287
  • 14
  • 85
  • 130
  • Thank you Jan-Philip! I have accepted Gabriel's answer simply because the first to mention "with Timeout(10):" construct although the code is basically similar. Thanks again) – Termos Jul 19 '13 at 13:13