How to make a celery worker stop receiving new tasks (Kubernetes)

Question

So we have a kubernetes cluster running some pods with celery workers. We are using python3.6 to run those workers and celery version is 3.1.2 (I know, really old, we are working on upgrading it). We have also setup some autoscaling mechanism to add more celery workers on the fly.

The problem is the following. So let's say we have 5 workers at any given time. Then lot of tasks come, increasing the CPU/RAM usage of the pods. That triggers an autoscaling event, adding, let's say, two more celery worker pods. So now those two new celery workers take some long running tasks. Before they finishing running those tasks, kubernetes creates a downscaling event, killing those two workers, and killing those long running tasks too.

Also, for legacy reasons, we do not have a retry mechanism if a task is not completed (and we cannot implement one right now).

So my question is, is there a way to tell kubernetes to wait for the celery worker to have run all of its pending tasks? I suppose the solution must include some way to notify the celery worker to make it stop receiving new tasks also. Right now I know that Kubernetes has some scripts to handle this kind of situations, but I do not know what to write on those scripts because I do not know how to make the celery worker stop receiving tasks.

Any idea?

score 4 · Accepted Answer · answered Aug 04 '22 at 11:37

I wrote a blog post exactly on that topic - check it out.

When Kubernetes decide to kill a pod, it first send SIGTERM signal, so your Application have time to gracefully shutdown, and after that if your Application didn't end - Kubernetes will kill it by sending a SIGKILL signal.

This period, between SIGTERM to SIGKILL can be tuned by terminationGracePeriodSeconds (more about it here).

In other words, if your longest task takes 5 minutes, make sure to set this value to something higher than 300 seconds.

Celery handle those signals for you as you can see here (I guess it is relevant for your version as well):

Shutdown should be accomplished using the TERM signal.

When shutdown is initiated the worker will finish all currently executing tasks before it actually terminates. If these tasks are important, you should wait for it to finish before doing anything drastic, like sending the KILL signal.

As explained in the docs, you can set the acks_late=True configuration so the task will run again if it stopped accidentally.

Another thing that I didn't find documentation for (almost sure I saw it somewhere) - Celery worker won't receive a new tasks after getting a SIGTERM - so you should be safe to terminate the worker (might require to set worker_prefetch_multiplier = 1 as well).

Do you know if I can use a script to choose when to send the SIGKILL after the SIGTERM? We have too many celery tasks with way too different max time of completion, so I prefer sending it the SIGKILL using a script to check if the tasks on the worker are finished. — Antonio Gamiz Delgado, Aug 05 '22 at 05:24
Why not to set the maximum value that the longest task takes? If a shorter task finish before that, Celery worker should stop by itself (not wait for the SIGKILL). The Kill signal happens only when the application (worker in our case) ignore the first signal and doesn't take care of exiting by itself — ItayB, Aug 05 '22 at 07:33
Uoh, that’s fantastic news. I was not aware of that. Thanks a lot for the clarification and for the complete answer!!!!!! :) — Antonio Gamiz Delgado, Aug 05 '22 at 07:40

How to make a celery worker stop receiving new tasks (Kubernetes)

1 Answers1

Linked