Celery + Eventlet pool does not improve execution speed of asynchronous web requests

Question

As mentioned in the celery docs, the eventlet pool should be faster than the prefork pool for evented I/O such as asynchronous HTTP requests.

They even mention that

"In an informal test with a feed hub system the Eventlet pool could fetch and process hundreds of feeds every second, while the prefork pool spent 14 seconds processing 100 feeds."

However, we are unable to produce any kind of results similar to this. Running the example tasks, urlopen and crawl exactly as described and opening thousands of urls, it appears that the prefork pool almost always performs better.

We tested with all sorts of concurrencies (prefork with concurrency 200, eventlet with concurrencies 200, 2000, 5000). In all of these cases the tasks complete in fewer seconds using the prefork pool.The machine being run on is a 2014 Macbook Pro with a RabbitMQ server running.

We are looking to make thousands of asynchronous HTTP requests at once and are wondering if the eventlet pool is even worth implementing? If it is, what are we missing?

Result of python -V && pip freeze is:

Python 2.7.6
amqp==1.4.6
anyjson==0.3.3
billiard==3.3.0.20
bitarray==0.8.1
celery==3.1.18
dnspython==1.12.0
eventlet==0.17.3
greenlet==0.4.5
kombu==3.0.26
pybloom==1.1
pytz==2015.2
requests==2.6.2
wsgiref==0.1.2

Test code used (pretty much exactly from the docs):

>>> from tasks import urlopen
>>> from celery import group
>>> LIST_OF_URLS = ['http://127.0.0.1'] * 10000 # 127.0.0.1 was just a local web server, also used 'http://google.com' and others
>>> result = group(urlopen.s(url)
...                     for url in LIST_OF_URLS).apply_async()

Are you going to use OSX in production and if not, could you run same tests on target OS? Also, please, post output of `python -V && pip freeze`. — temoto, Apr 30 '15 at 07:16
@temoto added the results of python -V && pip freeze! Will try on ubuntu (target OS) — Tristan, Apr 30 '15 at 15:11
Nothing suspicious so far. Could you also post commands used for speed testing? — temoto, Apr 30 '15 at 15:29
We ran various tests that were all over the place from hitting the instagram api to pinging google and they all had similar results in terms of speed. I'm going to add above a specific test we did, though. — Tristan, Apr 30 '15 at 22:42
Can you try to take Celery out of equation? I mean `eventlet.GreenPool` vs `multiprocessing.Pool`. — temoto, May 02 '15 at 03:04
As of 2017 with Celery 4.0.2, I had the same problem. Tested on OSX too. `prefork` is around 50% better than `eventlet` in my case. Tasks do very simple Postgres queries and HTTP downloads/uploads. — fjsj, Jul 06 '17 at 17:06
@fjsj seems like it's because you did **very simple** Postgres queries and HTTP downloads/uploads. Typical situation where `Eventlet` outperforms prefork is when you have many I/O bound operations such as long latency HTTP requests — , May 10 '18 at 02:10
@DerekKim so it's optimized for I/O but must be long latency I/O? — fjsj, May 11 '18 at 03:09
@fjsj Yes. If your task doesn't spend much time doing I/O operations, it should be considered [CPU-bound](https://en.wikipedia.org/wiki/CPU-bound) task, not [I/O-bound](https://en.wikipedia.org/wiki/I/O_bound) task, and therefore, it's hard to show performance boost by utilizing eventlet. — , May 11 '18 at 04:37

score 8 · Accepted Answer · answered May 10 '18 at 03:59

Eventlet allows you to have greater concurrency than prefork even without having to write non-blocking style code. Typical situation where Eventlet outperforms prefork is when you have many blocking I/O bound operations(ex. time.sleep or requests.get to high latency website). It seems that your requests to local host or 'http://google.com' gets responses too fast to be deemed as I/O-bounded.

You can try this toy example to see how Eventlet based pool performs better at I/O bound operations.

# in tasks.py add this function
import time

# ...

@task()
def simulate_IO_bound():
    print("Do some IO-bound stuff..")
    time.sleep(5)

Run the worker the same way and finally produce tasks

from tasks import simulate_IO_bound

NUM_REPEAT = 1000

results = [simulate_IO_bound.apply_async(queue='my') for i in range(NUM_REPEAT)]
for result in results:
    result.get()

Let's say you have prefork worker with 100 subprocesses and another worker with 1000 green threads, you will be able to see a dramatic difference.

I posted this question 3 years ago, so hard to confirm on my end, but this answer makes a lot of sense! Thanks for the explanation. — Tristan, May 12 '18 at 02:59

Celery + Eventlet pool does not improve execution speed of asynchronous web requests

1 Answers1