Speed of fetching web pages with Eventlet and Python?

Question

I am writing a relatively simple crawler in Python but I want to use asynchronous networking lib in order to fetch multiple pages concurrently.I saw the examples on their page but when I apply the same logic that is shown and works for ~200 web pages for ~1000/2000 urls , the performance slows down.(Most of the urls were from different domains and i have shuffled them ). What is the fastest way to crawl such number of pages with Eventlet and what speed can i get? (speed like fetches/s)

Here is the example:


urls = ["http://www.google.com/intl/en_ALL/images/logo.gif",
     "https://wiki.secondlife.com/w/images/secondlife.jpg",
     "http://us.i1.yimg.com/us.yimg.com/i/ww/beta/y3.gif"]

import eventlet
from eventlet.green import urllib2

def fetch(url):

  return urllib2.urlopen(url).read()

pool = eventlet.GreenPool()

for body in pool.imap(fetch, urls):
  print "got body", len(body)

Would [Scrapy](http://scrapy.org/) not be efficient enough for this task? — Acorn, Apr 20 '11 at 17:25
Maybe,but I want to create the crawler myself and I planned on crawling a bigger number of sites. I want first to make it fast for thousands. — Charles Ellis, Apr 20 '11 at 17:32

score 2 · Answer 1 · answered Apr 20 '11 at 18:41

We created a transformation proxy service with Spawning web server. Spawning is using eventlets internally. The purpose of a service was to expose legacy XML API to mobile applications (iPhone, Android, etc.)

http://pypi.python.org/pypi/Spawning/

1) The server calls ISS backed backend service which outputs XML using urllib

2) Python reads XML, transforms it to JSON. lxml was used for parsing, simplejson with native C extension compiled in for output

3) The resulting JSON was send to the client

The performance with eventlets was awesome > 1000 req/s on a server with 8 virtual cores. The performance was stable (zero error %s). No latencies. We had to do some balancing between number of processes and threads per process and I think we used something like 12 processes each with 20-50 threads.

We also tested Twisted and its async page get method. For Twisted, we managed to get performance of only 200 req/s before we started seeing too many errors. With Twisted, also the latencies started to grow quickly and doomed this approach.

The performance was measured with complex JMeter scripts which did all funky stuff like authentication, etc.

I think the key here was how Spawning monkey-patches urllib to be very async by nature.

Speed of fetching web pages with Eventlet and Python?

1 Answers1

Linked