Graceful scaledown of stateful apps in Kubernetes

Question

I have a stateful application deployed in Kubernetes cluster. Now the challenge is how do I scale down the cluster in a graceful way so that each pod while terminating (during scale down) completes it’s pending tasks and then gracefully shuts-down. The scenario is similar to what is explained below but in my case the pods terminating will have few inflight tasks to be processed.

https://medium.com/@marko.luksa/graceful-scaledown-of-stateful-apps-in-kubernetes-2205fc556ba9 1

Do we have an official feature support for this from kubernetes api.

Kubernetes version: v1.11.0

Host OS: linux/amd64

CRI version: Docker 1.13.1

UPDATE :

Possible Solution - While performing a statefulset scale-down the preStop hook for the terminating pod(s) will send a message notification to a queue with the meta-data details of the resp. task(s) to be completed. Afterwards use a K8 Job to complete the tasks. Please do comment if the same is a recommended approach from K8 perspective.

Thanks In Advance!

Regards, Balu

This is discussed at least briefly in the [StatefulSet documentation](https://kubernetes.io/docs/concepts/workloads/controllers/statefulset/#deployment-and-scaling-guarantees): `pod-n` won't start terminating until `pod-n+1` is completely stopped. What do you have so far, and what isn't working? — David Maze, Jul 07 '20 at 10:25
Dear David, This point is understood, my specific question is that how pods while terminating could complete their pending tasks. Eg:- With a statefulset cluster of 3 replicas while I scale-down to 2 replicas how come my pod-3 while terminating could ensure that all the pending tasks(tagged to pod-3) are finished. — Balu R, Jul 07 '20 at 14:37
[Termination of Pods](https://kubernetes.io/docs/concepts/workloads/pods/pod/#termination-of-pods) is the other key documentation link. Your service gets SIGTERM and has (by default) 30 seconds to finish up; finish your pending tasks and end the process, or get SIGKILL and forcible termination. — David Maze, Jul 07 '20 at 14:48
The preStop hook is something which I tried but predicting the terminationGracePeriodSeconds is a challenge as I am not aware of the exact time required for the pending tasks completion (as it could vary time to time). — Balu R, Jul 08 '20 at 15:23

Vamshi Siddarth · Answer 1 · 2020-07-10T16:12:42.453

2

Your pod will be scaled down only after the in-progress job is completed. You may additionally configure the lifecycle in the deployment manifest with prestop attribute which will gracefully stop your application. This is one of the best practices to follow. Please refer this for detailed explanation and syntax.

Updated Answer

This is the yaml I tried to deploy on my local and tried generating the load to raise the cpu utilization and trigger the hpa.

Deployment.yaml

kind: Deployment
apiVersion: apps/v1
metadata:
  namespace: default
  name: whoami
  labels:
    app: whoami
spec:
  replicas: 1
  selector:
    matchLabels:
      app: whoami
  template:
    metadata:
      labels:
        app: whoami
    spec:
      containers:
        - name: whoami
          image: containous/whoami
          resources:
            requests:
              cpu: 30m
            limits:
              cpu: 40m
          ports:
            - name: web
              containerPort: 80
          lifecycle:
            preStop:
              exec:
                command:
                  - /bin/sh
                  - echo "Starting Sleep"; date; sleep 600; echo "Pod will be terminated now"
---
apiVersion: autoscaling/v2beta1
kind: HorizontalPodAutoscaler
metadata:
  name: whoami
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: whoami
  minReplicas: 1
  maxReplicas: 3
  metrics:
    - type: Resource
      resource:
        name: cpu
        targetAverageUtilization: 40
#    - type: Resource
#      resource:
#        name: memory
#        targetAverageUtilization: 10
---
apiVersion: v1
kind: Service
metadata:
  name: whoami-service
spec:
  ports:
    - port: 80
      targetPort: 80
      protocol: TCP
      name: http
  selector:
    app: whoami

Once the pod is deployed, execute the below command which will generate the load.

kubectl run -i --tty load-generator --image=busybox /bin/sh

while true; do wget -q -O- http://whoami-service.default.svc.cluster.local; done

Once the replicas are created, I stopped the load and the pods are terminated after 600 seconds. This scenario worked for me. I believe this would be the similar case for statefulset as well. Hope this helps.

edited Jul 10 '20 at 16:12

answered Jul 07 '20 at 12:24

Vamshi Siddarth

620
1
4
19

Dear Vamshi, My scenario is like...With a statefulset cluster of 3 replicas while I scale-down to 2 replicas how come my pod-3 while terminating could ensure that all the pending tasks(tagged to pod-3) are finished. The preStop hook is something which I tried but predicting the terminationGracePeriodSeconds is a challenge as I am not aware of the exact time required for the pending tasks completion. – Balu R Jul 07 '20 at 14:40
we can reach to a conclusion by having a sandbox created on the local machine where your pods are assigned with a task and we can try scaling down manually and verify if the job is gracefully completed or not. Example: try to run a shell script which sleeps for 300 seconds and writes an echo message to a file. Let me know if you need any details. – Vamshi Siddarth Jul 07 '20 at 16:50
Also, in the prestop attribute, if configured, we need to provide the service stop command. This will now give the precedence to your application to gracefully manage the jobs over the application cluster and stop the services. – Vamshi Siddarth Jul 07 '20 at 16:59
So are we saying here that if the preStop hook is configured with the application/service stop command then the SIGKILL(force termination) will not take precedence (even after the terminationGracePeriodSeconds is over). – Balu R Jul 08 '20 at 04:32
1

Yes, that is the understanding. Ofcourse, for full observability we need to definitely try the test scenario to be absolutely sure. – Vamshi Siddarth Jul 08 '20 at 06:16
Thank U. Will test to evaluate. – Balu R Jul 08 '20 at 06:29
Hi Vamshi, Did a test run and I see the preStop hook has no precedence over terminationGracePeriodSeconds. I had a sleep in my service stop for 5 mins and terminationGracePeriodSeconds was configured as 2 mins, but the pod was terminated in 2 mins not waiting for the preStop hook to complete. The cluster sends the SIGKILL once the terminationGracePeriodSeconds is over. – Balu R Jul 08 '20 at 15:20
I see. How did you try to scaling down the pod. Is it using the kubectl command or HPA? Even I'm a bit confused with [this](https://kubernetes.io/docs/concepts/containers/container-lifecycle-hooks/#hook-delivery-guarantees) – Vamshi Siddarth Jul 08 '20 at 17:08
kubectl was used for scaling down. "Kubernetes sends the preStop event immediately before the Container is terminated. Kubernetes’ management of the Container blocks until the preStop handler completes, unless the Pod’s grace period expires" https://stackoverflow.com/questions/61074948/with-kubernetes-is-there-a-way-to-wait-for-a-pod-to-finish-its-ongoing-tasks-bef – Balu R Jul 09 '20 at 04:11
yeah, that makes sense. So this seem to be answering you question right? – Vamshi Siddarth Jul 09 '20 at 05:52
Infact not answering my original question/post. How in Statefulsets (during scale-down) the pending tasks (tagged to the pod) can be completed while pod termination. Do we have a standard approach for these scenario. preStop was tried as an option but looks like it will not be a fit due to the above reason. – Balu R Jul 09 '20 at 06:07
Why would you want to scale down manually, I wanted to understand this part of the use case. Usually, we look for this kind of config while we use the HPA so that the auto scaledown doesn't exit the pod with incomplete task. I will try to generate a deployment file with what I was understanding and share you today. We can check on that and come to a conclusion. – Vamshi Siddarth Jul 09 '20 at 06:43
OK Thanks. Do we have a separate behavior for scaling down statefulsets when done using a HPA or Manually (using kubectl CLI) ?. I believe behavior of preStop hook and terminationGracePeriodSeconds is same across HPA or kubectl scale-down, right ? – Balu R Jul 09 '20 at 07:13
updated the answer with use case. Please check once and let me know if that helps. – Vamshi Siddarth Jul 10 '20 at 16:13
Hi Vamshi, I had a run with the HPA and a deployment where the preStop hook was implemented with a increased wait of 1200 secs. Once the scale-down event was triggered the pods terminated well before 1200 secs. Meanwhile the HPA has waits between events to avoid thrashing, hope this was not reflecting as the waits for pod termination in your case (just a guess). As I still see from the documentation the pods can only wait till terminationGracePeriodSeconds and if this expires, even if the preStop hook is between an execution still the SIGKILL is SENT. – Balu R Jul 17 '20 at 13:39

Graceful scaledown of stateful apps in Kubernetes

1 Answers1

Updated Answer