How does Kubernetes knows what pod to kill when downscaling?

Question

Is there a way to tell Kubernetes what pods to kill before or after a downscale? For example, suppose that I have 10 replicas and I want to downscale them to 5, but I want certain replicas to be alive and others to be killed after the downscale. Is that possible?

Why would you want to do this? This may be possible, but it depends what you're really trying to achieve which might affect which solutions are viable for you. — Amit Kumar Gupta, Sep 11 '19 at 00:38
@AmitKumarGupta have an application that scales the pods based on some metrics and the pods consume a queue. What I want is to only kill idle pods when the queue is empty. — Matheus Melo, Sep 11 '19 at 00:41
Are all the pods reading from the same queue? And is your concern that k8s will kill off a pod that's currently working on a task, and doing so mid-flight will leave it in an inconsistent state? If so, you're probably trying to solve the wrong problem. The real problem is that you aren't gracefully handling task failures. — Grant David Bachman, Sep 11 '19 at 01:09
@GrantDavidBachman In that case, in my solution, the task goes to the end of the queue to be processed again by another pod. The problem is when I have a few, big tasks. In that case, when the queue is empty, the controller will downscale the pods, killing idle and working pods indiscriminately and making another pod process that task even if it was 99% complete. — Matheus Melo, Sep 11 '19 at 01:16
Possible duplicate of [Kubernetes scale down specific pods](https://stackoverflow.com/questions/33617090/kubernetes-scale-down-specific-pods) — Matt, Sep 11 '19 at 03:26

score 4 · Accepted Answer · answered Sep 11 '19 at 21:51

While it's not possible to selectively choose which pod is killed, you can prevent what you're really concerned about, which is the killing of pods that are in the midst of processing tasks. This requires you do two things:

Your application should be able to listen for and handle SIGTERM events, which Kubernetes sends to pods before it kills them. In your case, your app would handle SIGTERM by finishing any in-flight tasks then exiting.
You set the terminationGracePeriodSeconds on the pod to something greater than the longest time it takes for the longest task to be processed. Setting this property extends the period of time between k8s sending the SIGTERM (asking your application to finish up), and SIGKILL (forcefully terminating).

score 1 · Answer 2 · answered Sep 11 '19 at 13:47

1

As per provided by @Matt link and @Robert Bailey's answer, currently K8s ReplicaSets based resources don't support scaling functions, removing some specific Pods from replicas pool. You can find related #45509 issue and followed up #75763 PR.

answered Sep 11 '19 at 13:47

Nick_Kh

5,089
2
10
16

score 1 · Answer 3 · answered Sep 12 '19 at 05:48

you can use stateful sets instead of replicasets: https://kubernetes.io/docs/concepts/workloads/controllers/statefulset/ they will be created sequentially (my-app0,my-app1,myapp2), and when you will scale down, they will be terminated in reverse order, from {N-1..0}.

How does Kubernetes knows what pod to kill when downscaling?

3 Answers3