3

I am using Celery to distribute tasks to multiple servers. For some reason, adding 7,000 tasks to the queue is incredibly slow and appears to be CPU bound. It takes 12 seconds to execute the code below, which is just adding tasks to the queue.

start = time.time()
for url in urls:
    fetch_url.apply_async((url.strip(),), queue='fetch_url_queue')
print time.time() - start

Switching between brokers (have tried redis, RabbitMQ, and pyamqp) does not have any significant effect.

Reducing the number of workers (which are each running on their own server, separate from the master server which adds the tasks) does not have any significant effect.

The URLs being passed are very small, each just about 80 characters.

The latency between any two given servers in my configuration is sub-millisecond (<1ms).

I must be doing something wrong. Surely Celery must be able to add 7,000 tasks to the queue in less time than several seconds.

monstermac77
  • 256
  • 3
  • 24
  • How long would you expect it to take to add 7,000 of anything? It seems that it would be unreasonable to expect that to be instantaneous. – theMayer Feb 07 '18 at 23:11
  • I was by no means expecting instantaneity, but given the small amount of data being passed with each task (an 80 character URL), I was expecting something on the order of 1 second. – monstermac77 Feb 08 '18 at 07:24
  • How are your queues configured? Are they set up as persistent? – theMayer Feb 08 '18 at 13:37
  • No, to try to boost performance I made my queue transient (set `durable=False` and `delivery_mode=1`). – monstermac77 Feb 10 '18 at 07:34

1 Answers1

3

The rate at which tasks can be queued depends on celery broker you are using and your server cpu.

With AMD A4-5000 CPU & 4GB ram, here are task rates for various brokers

# memory -> 400 tasks/sec
# rabbitmq -> 220 tasks/sec
# postgres -> 30 tasks/sec

With Intel(R) Xeon(R) CPU E5-2676 v3 @ 2.40GHz & 4GB ram

# memory -> 2000 tasks/sec
# rabbitmq -> 1500 tasks/sec
# postgres -> 200 tasks/sec
Chillar Anand
  • 27,936
  • 9
  • 119
  • 136
  • Looks like you're right. I just really didn't expect adding 7,000 tasks to take multiple seconds, even in the best case. Do you know how I could call apply_async() in multiple threads to halve the time by adding another CPU? Ideally in python 2.7. – monstermac77 Feb 07 '18 at 18:20
  • 1
    1500 tasks/sec is probably a best-case scenario. It's going to depend heavily on the content of the message, as RabbitMQ has batch processing (as opposed to stream) of the individual messages. This would be an upper-bound, and would not be increased by multiple producers generating messages asynchronously. You'd also have to contend with any [auto-throttling](https://www.rabbitmq.com/blog/2015/10/06/new-credit-flow-settings-on-rabbitmq-3-5-5/). – theMayer Feb 07 '18 at 23:14
  • Thanks very much @theMayer, I didn't realize all of these limitations. In order to speed up adding tasks to the queue, it looks like chunking has been able to reduce the time. I don't suppose there is much downside to this approach as a way of getting tasks into the queue faster? http://docs.celeryproject.org/en/latest/userguide/canvas.html#chunks – monstermac77 Feb 08 '18 at 07:20
  • 2
    @monstermac77 If you have all the arguments for your tasks beforehand, chunking is a good option. – Chillar Anand Feb 08 '18 at 12:01
  • @ChillarAnand it seems like chunking has been working great at reducing the time it takes to add tasks to the queue, but it seems like a worker will not work on the tasks within a chunk in parallel, only serially. That is, say worker W gets chunk 1, containing tasks (A, B, C). It seems like worker W is doing task A, then waiting for task A to complete before starting task B, etc. Is this expected behavior and if so is there a way to make it so that tasks within a chunk can be worked on in parallel? – monstermac77 Feb 11 '18 at 19:45
  • 1
    It is expected behavior. Do you have single worker or a cluster of workers to run tasks? – Chillar Anand Feb 12 '18 at 05:32
  • 2
    Got it, so to use both chunking and to parallelize tasks within a chunk, I need to create my own chunking mechanism (since using Celery's `.chunks()` function does not appear to allow parallelization of tasks within a chunk)? That is, say I have `fetch_url.apply_async((url,))` right now, I would need to change that to `fetch_url.apply_async(([url1, url2, url3],))` and then within `fetch_url` I would need to write my own code to make requests for `url1`, `url2`, and `url3` in parallel? And to answer your question: I have a cluster of workers running the tasks (each worker has its own server). – monstermac77 Feb 12 '18 at 23:39