1

I am putting together a Celery based data ingestion pipeline. One thing I do not see anywhere in the documentation is how to build a flow where workers are only running when there is work to be done. (seems like a major flaw in the design of Celery honestly)

I understand Celery itself won't handle autoscaling of actual servers, thats fine, but when I simulate this Flower doesn't see the work that was submitted unless the worker was online when the task was submitted. Why? I'd love a world where I'm not paying for servers unless there is actual work to be done.

Workflow:

  1. Imagine a While loop thats adding new data to be processed using the celery_app.send_task method.

  2. I have custom code that sees theres N messages in the queue. It spins up a Server, and starts a Celery worker for that task.

  3. Celery worker comes online, and does the work.

BUT.

Flower has no record of that task, even though I see the broker has a "message", and while watchings the output of the worker, I can see it did its thing.

If I keep the worker online, and then submit a task, it monitors everything just fine and dandy.

Anyone know why?

chasez0r
  • 544
  • 5
  • 18

1 Answers1

0

You can use celery autoscaling. For example setting autoscale to 8 will mean it will fire up to 8 processes to process your queue(s). It will have a master process sitting waiting though. You can also set a minimum, for example 2-8 which will have 2 workers waiting but fire up some more (up to 8) if it needs to (and then scale down when the queue is empty).

This is the process based autoscaler. You can use it as a reference if you want to create a cloud based autoscaler for example that fires up new nodes instead of just processes.

As to your flower issue it's hard to say without knowing your broker (redis/rabbit/etc). Flower doesn't capture everything as it relies on the broker doing that and some configuration causes the broker to delete information like what tasks have run.

dalore
  • 5,594
  • 1
  • 36
  • 38