Celery Workers' Utilization Decreases With More Workers

Question

I'm trying to make thousands of GET requests in the smallest amount of time possible. I need to do so in a scalable way: doubling the number of servers I use to make the requests should halve the time to complete for a fixed number of URLs.

I'm using Celery with the eventlet pool and RabbitMQ as the broker. I'm spawning one worker process on each worker server with --concurrency 100 and have a dedicated master server issuing tasks (the code below). I'm not getting the results I expect: the time to complete is not reduced at all when doubling the number of worker servers used.

It appears as though as I add more worker servers, the utilization of each worker goes down (as reported by Flower). For example, with 2 workers, throughout execution the number of active threads per worker hovers in the 80 to 90 range (as expected, since concurrency is 100). However, with 6 workers, the number of active threads per worker hovers in the 10 to 20 range.

It's almost like the queue size is too small, or worker servers can't pull tasks off of the queue fast enough to be fully utilized and as you add more workers they have a harder time pulling tasks off the queue quickly.

urls = ["https://...", ..., "https://..."]
tasks = []
num = 0
for url in urls:
    num = num + 1
    tasks.append(fetch_url.s(num, url))

job = group(tasks)
start = time.time()
res = job.apply_async()
res.join()
print time.time() - start

Update: I have attached a graph of the succeeded tasks vs. time when using 1 worker server, 2 worker servers, etc. up to 5 worker servers. As you can see, the rate of task completion doubles going from 1 worker server to 2 worker servers, but as I add on more servers, the rate of task completion begins to level off.

How did you ensure remote server can sustain increasing load? — temoto, Feb 07 '18 at 08:34
Are you referring to the servers I'm hitting with my GET requests? — monstermac77, Feb 07 '18 at 09:18
The GET requests are actually hitting hundreds of different servers, each of which is definitely able to handle this load (they're designed to). I think there might be a bottleneck in adding tasks to the queue; essentially, I think adding more workers beyond 3 doesn't get a speed up because tasks are not added to the queue fast enough for all the workers to be fully utilized. Any ideas on how to speed up adding tasks, ideally with python 2.7 (maybe multithreading adding the tasks so I can just add more CPUs)? — monstermac77, Feb 07 '18 at 18:40
First, try to replace http request with `eventlet.sleep(0.2)`. Second, try to access target service via insecure http, a relevant bug was recently fixed in eventlet. Third, getting rid of rabbitmq, I hate to say it, is always a good idea, redis broker works better. And finally, I suggest getting rid of celery if you have to process each request individually. Otherwise, group requests and send to queue in small batches, this definitely will help against queue performance problem (if there is one). — temoto, Feb 08 '18 at 14:09
@temoto, very good suggestions. 1. I've been looking at the task completion time in Flower and pretty much all of them come back in about 0.1 seconds, but that's a good idea for those who could be hitting their target servers too hard. 2. Unfortunately, all of these target servers redirect to https. 3. Since making the post I did switch to redis and you're absolutely right: it is faster. 4. Have looked into dropping down to kombu, but your suggestion to group requests was brilliant. It does seem like the bottleneck was adding tasks to the queue, because using chunking in Celery fixed this. — monstermac77, Feb 11 '18 at 20:14
@temoto, the only issue with chunking is that it seems like a worker will not work on the tasks within a chunk in parallel, only serially. That is, say worker W gets chunk 1, containing tasks (A, B, C). It seems like worker W is doing task A, then waiting for task A to complete before starting task B, etc. Is this expected behavior? I could be wrong. It could be that tasks A, B, and C are executed in parallel, but Flower misleadingly shows the task completion time for this chunk as the sum of the individual task completion times, even though the tasks had executed in parallel. — monstermac77, Feb 11 '18 at 20:30
on task concurrency, see extended details in response and then try HTTP (not https) to dummy server. If the issue happens to be due to https, try workaround from here https://github.com/eventlet/eventlet/issues/457 — temoto, Feb 12 '18 at 21:18
Thanks. I was able to confirm from some tests that using Celery's built in chunking mechanism does not allow for parallelization of tasks within a chunk, so by using Celery's built-in chunking mechanism you are indeed cutting down on how quickly you can add tasks to the queue, but at the sacrifice of parallelization. More discussion on this below. — monstermac77, Feb 25 '18 at 21:40

score 2 · Accepted Answer · answered Feb 12 '18 at 21:11

2

For future readers. Actions that helped, most significant benefit first:

Group several small work units into one celery task
Switch Celery broker from RabbitMQ to Redis

More useful hints not mentioned in original comment discussion, therefore unknown benefit significance for this question.

Use httplib2 or urllib3 or better HTTP library. requests burns CPU for no good reason
Use HTTP connection pool. Check and make sure you reuse permanent connections to target servers.

Chunking explained.

Before chunking

urls = [...]

function task(url)
  response = http_fetch(url)
  return process(response.body)

celery.apply_async url1
celery.apply_async url2
...

So task queue contains N=len(urls) tasks, each task is to fetch single url, perform some calculations on response.

With chunking

function chunk(xs, n)
  loop:
  g, rest = xs[:n], xs[n:]
  yield g

chunks = [ [url1, url2, url3], [4, 5, 6], ... ]

function task(chunk)
  pool = eventlet.GreenPool()
  result = {
    response.url: process(response)
    for response in pool.imap(http_fetch, chunk)
  }
  return result

celery.apply_async chunk1
celery.apply_async chunk2
...

Now task queue contains M=len(urls)/chunksize tasks, each task is to fetch chunksize urls and process all responses. Now you have to multiplex concurrent url fetches inside single chunk. Here it's done with Eventlet GreenPool.

Note, because Python, it is likely beneficial to first perform all network IO then perform all CPU calculations on all responses in chunk, amortizing CPU load via multiple celery workers.

All code in this answer is showing general direction only. You must implement better version with less copying and allocations.

answered Feb 12 '18 at 21:11

temoto

5,394
3
34
50

This is great! Interestingly, I implemented this custom chunking solution that you suggested (to get tasks within a chunk to execute in parallel) and was only able to get a speed up of 7% on about 12,000 URLs. I think this is because parallelizing within a chunk only really helps you with overall performance in the worst case scenario, i.e. when you happen to put multiple slow URLs in the same chunk. So unless your program's overall execution is consistently being slowed down by a couple of stragglers in the same chunk, parallelizing within a chunk will yield little benefit. – monstermac77 Feb 25 '18 at 21:58
Regarding broker choice: as you suggest, Redis is able to consistently perform 30% faster than RabbitMQ. That is, the time from adding tasks to the queue to when .get() returns is 30% faster when using Redis vs. RabbitMQ. Any specific ideas on why this might be, or is this just expected because RabbitMQ has more overhead? Turning off durability has had no measurable effect on performance. This is a significant speed-up, so I'd like to use Redis, but I know RabbitMQ is more battle tested for production and I've had stability issues that may have been due to Redis (unsure, still testing this). – monstermac77 Feb 25 '18 at 22:12
1

RabbitMQ is message queue application and Redis (in this scenario) is linked list via socket. So of course Rabbit ought to have excess overhead. Part of that overhead has to be implemented in "client side" broker code (kombu or something) and unsurprisingly, Python is slower than (anything) Erlang. Specific idea (sorry not the one you asked for) - measure overall system performance and decide whether you are happy with it. I know it's a tough decision but otherwise you either pay too much or fall into bottomless optimization pit. – temoto Feb 26 '18 at 23:05
I have now confirmed that Redis is consistently failing for me after running perfectly fine for several hours. Since you seem familiar with Redis, perhaps there's something obvious to you that's causing .get() in Celery to hang indefinitely: https://stackoverflow.com/q/49006182/2611730. If I can't get Redis to work, I'll have to settle for the significantly slower (but hopefully more stable) RabbitMQ. I'm simply running out of things I can think of to try to get Redis working reliably. – monstermac77 Feb 28 '18 at 03:37
So I ended up trying RabbitMQ again. For some weird reason, after spinning up new servers (no code changes), RabbitMQ was performing twice as fast as it was previously. On the old servers, it took RabbitMQ 70s to complete the tasks and it took Redis about 55s, and with the new servers it went down to 35s and 30s respectively. Any ideas on why this could have happened? Both sets of servers were all in the same data center. – monstermac77 Mar 10 '18 at 02:55
1

@monstermac77 sorry mate, that is broad guessing now. They may use different/fixed hardware (without you knowing), network connection, there is not enough information for definite reason. – temoto Mar 11 '18 at 10:10
thanks, yeah, I know I'm not giving you much to work with. I ended up testing a whole bunch of other conditions after that and was not able to reproduce the much slower performance (tried different server configs (less CPU/less RAM), tried moving half of the workers to a different data center in the same city, tried moving them across the country, etc.). So my guess is that it was some sort of transient hardware thing with our provider or possibly RabbitMQ had built up some nasty mix of connections during our testing that was causing a slowdown? Not sure if that's possible. – monstermac77 Mar 14 '18 at 17:44

Celery Workers' Utilization Decreases With More Workers

1 Answers1

Before chunking

With chunking