0

We run a kubernetes cluster provisioned with kubespray and discovered that each time when a faulty node goes down (we had this due to hardware issue recently) the pods executing on this node stuck in Terminating state indefinitely. Even after many hours the pods are not being redeploying on healthy nodes and thus our entire application is malfunctioning and the users are affected for a prolonged period of time.

How it is possible to configure kubernetes to perform failover in situations like this?

Below is our statefulset manifest.

apiVersion: apps/v1
kind: StatefulSet
metadata:
  namespace: project-stock
  name: ps-ra
spec:
  selector:
    matchLabels:
      infrastructure: ps
      application: report-api
      environment: staging
  serviceName: hl-ps-report-api
  replicas: 1
  template:
    metadata:
      namespace: project-stock
      labels:
        infrastructure: ps
        application: report-api
        environment: staging
    spec:
      terminationGracePeriodSeconds: 10
      containers:
        - name: ps-report-api
          image: localhost:5000/ps/nodejs-chrome-application:latest
          ports:
            - containerPort: 3000
              protocol: TCP
              name: nodejs-rest-api
          volumeMounts:
          resources:
            limits:
              cpu: 1000m
              memory: 8192Mi
            requests:
              cpu: 333m
              memory: 8192Mi
          livenessProbe:
            httpGet:
              path: /health/
              port: 3000
            initialDelaySeconds: 180
            periodSeconds: 10
            failureThreshold: 12
            timeoutSeconds: 10
E_net4
  • 27,810
  • 13
  • 101
  • 139
roman
  • 892
  • 9
  • 26
  • Did you use a deployment while creating pods ? Also as a recommendation its always good to share the yaml information so as to be helped. – Baguma Aug 30 '21 at 07:43
  • You can check this link https://kubernetes.io/docs/concepts/scheduling-eviction/node-pressure-eviction/ to read more about eviction policies which will help you I guess according to your use case – Chandra Sekar Aug 30 '21 at 07:51
  • The above link does not seems to be helpful in my case as it does not answer why the pods are stuck in terminating state. As far as I understand the pods must be restarted on helthy nodes in a short time, by kubernetes design. – roman Aug 30 '21 at 10:51
  • Hello @roman, I have dealt with a similar situation and whenever a pod is stuck in a terminating state, it has to do with some of the resources that are not able to be deleted. It might be volumes, or any dependent resources as you are using a statefulset. Please check the logs of every resource that the pods use. That might help with some information. Let me know if you find anything interesting. btw, I came to this link from upwork. – YYashwanth Aug 31 '21 at 05:18
  • In this case the pod does not use any extra resources like described in manifest above. The pod is stuck in Terminating state as long as the broker worker node is down. As soon as it returns back online, the pods are successfully restarted on the same or other nodes. – roman Aug 31 '21 at 05:24
  • Also, it was working correctly in case of failures when the cluster was provisioned with cops on aws. Once we switched to kubespray we got the above issue. – roman Aug 31 '21 at 05:28
  • Hi @roman, which Kubernetes version did you use on the Kops on AWS which version are you using on the Kubespray? How exactly did you configure Kubespray - as I understand you are using some bare-metal solution? – Mikolaj S. Aug 31 '21 at 09:35
  • @mikolaj-s, we were using 1.18.6 on kops and are currently using 1.18.10 on kubespray cluster. The cluster was provisioned using default configuration from the kubespray branch named release-2.14. New cluster is a bare-metal cluster, correct. – roman Aug 31 '21 at 12:49

1 Answers1

1

Posted community wiki for better visibility. Feel free to expand it.


In my opinion, the behaviour on your kubespray cluster (pod staying in Terminating state) is fully intentional. Based on Kubernetes documentation:

A Pod is not deleted automatically when a node is unreachable. The Pods running on an unreachable Node enter the 'Terminating' or 'Unknown' state after a timeout. Pods may also enter these states when the user attempts graceful deletion of a Pod on an unreachable Node.

The same documentation introduces ways in which a Pod in Terminating state can be removed. Also there are some recommended best practices:

The only ways in which a Pod in such a state can be removed from the apiserver are as follows:

  • The Node object is deleted (either by you, or by the Node Controller).
  • The kubelet on the unresponsive Node starts responding, kills the Pod and removes the entry from the apiserver.
  • Force deletion of the Pod by the user.

The recommended best practice is to use the first or second approach. If a Node is confirmed to be dead (e.g. permanently disconnected from the network, powered down, etc), then delete the Node object. If the Node is suffering from a network partition, then try to resolve this or wait for it to resolve. When the partition heals, the kubelet will complete the deletion of the Pod and free up its name in the apiserver. Normally, the system completes the deletion once the Pod is no longer running on a Node, or the Node is deleted by an administrator. You may override this by force deleting the Pod.

You can implement Graceful Node Shutdown if your node is shutdown in one of the following ways:

On Linux, your system can shut down in many different situations. For example:

  • A user or script running shutdown -h now or systemctl poweroff or systemctl reboot.
  • Physically pressing a power button on the machine.
  • Stopping a VM instance on a cloud provider, e.g. gcloud compute instances stop on GCP.
  • A Preemptible VM or Spot Instance that your cloud provider can terminate unexpectedly, but with a brief warning.

Keep in mind this feature is supported from version 1.20 (where it is in alpha state) and up (currently in 1.21 is in beta state).

The other solution, mentioned in documentation is to manually delete a node, for example using a kubectl delete node <your-node-name>:

If a Node is confirmed to be dead (e.g. permanently disconnected from the network, powered down, etc), then delete the Node object.

Then pod will be re-scheduled on the other node.

The last workaround is to set TerminationGracePeriodSeconds to 0, but this is strongly discouraged:

For the above to lead to graceful termination, the Pod must not specify a pod.Spec.TerminationGracePeriodSeconds of 0. The practice of setting a pod.Spec.TerminationGracePeriodSeconds of 0 seconds is unsafe and strongly discouraged for StatefulSet Pods. Graceful deletion is safe and will ensure that the Pod shuts down gracefully before the kubelet deletes the name from the apiserver.

Mikolaj S.
  • 2,850
  • 1
  • 5
  • 17