0

I'm having a major problem in my celery + rabbitmq app where queuing up my jobs is taking longer than the time for my workers to perform jobs. No matter how many machines I spin up, my queuing time will always overtake my task time.

This is because I have one celery_client script on one machine doing all the queuing (calling task.delay()) sequentially. I am iterating through a list of files stored in S3. How can I parallelize the queuing process? I imagine this is a widespread basic problem, yet I cannot find a solution.

EDIT: to clarify, I am calling task.delay() inside a for loop that iterates through a list of S3 files (of which there are a huge amount of small files). I need to get the result back to me so I can return it to the client, so for this reason I iterate through a list of results after the above to see if the result is completed -- if it is, I append it to a result file.

Some solutions I can think of immediately is some kind of multi threaded support in my for loop, but I am not sure whether .delay() would work with this. Is there no built in celery support for this problem?

EDIT2 More details: I am using one queue in my celeryconfig -- my tasks are all the same.

EDIT3: I came across "chunking", where you can group a lot of small tasks into one big one. Not sure if this can help out my problem, as although I can transform a large number of small tasks into a small number of big ones, my for loop is still sequential. I could not find much information in the docs.

jeffrey
  • 3,196
  • 7
  • 26
  • 44
  • I think what you were looking for is groups. Take a look at [this answer](https://stackoverflow.com/a/33259298/1150701) – greggmi Jul 25 '17 at 01:04
  • [Here is the documentation on groups](https://celery.readthedocs.io/en/latest/getting-started/next-steps.html#groups) – greggmi Jul 25 '17 at 01:29

1 Answers1

0

If queuing up your tasks takes longer than the task, how about you increase the scope of the tasks so they operate on N files at a time. So instead of queuing up 1000 tasks for 1000 files. You queue up 10 tasks that operate on 100 files at a time.

Make your task take a list of files, rather than a file for input. Then when you loop through your list of files you can loop 100 at time.

dalore
  • 5,594
  • 1
  • 36
  • 38