I am writing a relatively simple crawler in Python but I want to use asynchronous networking lib in order to fetch multiple pages concurrently.I saw the examples on their page but when I apply the same logic that is shown and works for ~200 web pages for ~1000/2000 urls , the performance slows down.(Most of the urls were from different domains and i have shuffled them ). What is the fastest way to crawl such number of pages with Eventlet and what speed can i get? (speed like fetches/s)
Here is the example:
urls = ["http://www.google.com/intl/en_ALL/images/logo.gif",
"https://wiki.secondlife.com/w/images/secondlife.jpg",
"http://us.i1.yimg.com/us.yimg.com/i/ww/beta/y3.gif"]
import eventlet
from eventlet.green import urllib2
def fetch(url):
return urllib2.urlopen(url).read()
pool = eventlet.GreenPool()
for body in pool.imap(fetch, urls):
print "got body", len(body)