3

Is it possible for a pod/deployment/statefulset to be moved to another node or be recreated on another node automatically if the first node fails? The pod in question is set to 1 replica. So is it possible to configure some sort of failover for kubernetes pods? I've tried out pod affinity settings but nothing is moved automatically it has been around 10 minutes.

the yaml for the said pod is like below:

kind: PersistentVolumeClaim
apiVersion: v1
metadata:
  name: ceph-rbd-sc-pvc
spec:
  accessModes:
    - ReadWriteOnce
  resources:
    requests:
      storage: 2Gi
  storageClassName: ceph-rbd-sc
---
apiVersion: v1
kind: Pod
metadata:
  name: ceph-rbd-pod-pvc-sc
  labels:
    app: ceph-rbd-pod-pvc-sc
spec:
  containers:
  - name:  ceph-rbd-pod-pvc-sc
    image: busybox
    command: ["sleep", "infinity"]
    volumeMounts:
    - mountPath: /mnt/ceph_rbd
      name: volume
  nodeSelector:
    etiket: worker
  volumes:
  - name: volume
    persistentVolumeClaim:
      claimName: ceph-rbd-sc-pvc
  affinity:
    podAntiAffinity:
      requiredDuringSchedulingIgnoredDuringExecution:
      - labelSelector:
          matchLabels:
            name: ceph-rbd-pod-pvc-sc
        topologyKey: "kubernetes.io/hostname"

Edit:

I managed to get it to work. But now i have another problem, the newly created pod in the other node is stuck in "container creating" and the old pod is stuck in "terminating". I also get Multi-Attach error for volume stating that the pv is still in use by the old pod. The situation is the same for any deployment/statefulset with a pv attached, the problem is resolved only when the failed node comes back online. Is there a solution for this?

Nyquillus
  • 179
  • 1
  • 5
  • 23
  • This is exactly what kubernetes does. It really depends what "the node fails" means to you. If you explain a little bit better what's happening in your cluster and your environment maybe you can get a meaningful answer – whites11 Jul 08 '21 at 07:14
  • I'm testing with a 3 node (1 master 2 worker) cluster with low specs which are about 4 cores and 8 gigs of ram. All the nodes are Virtual machines. Trying to get a grip on things generally in small environments before getting into baremetal clusters. With "node fail" i meant, for example if the machine shuts down or network connectivity drops etc. As I mentioned in above i couldn't get the pod to move automatically after a force shutdown on the first node. – Nyquillus Jul 08 '21 at 07:22
  • is the first node the master node? than this is working as expected. as the scheduler and kube-api-server pods run on master nodes and not on the worker nodes. so if the master fails, no new pods can and will get scheduled on worker nodes. except if they are static pods located by default under /etc/kubernetes/manifestests. however, those pods won't get scheduled on other pods anyways – meaningqo Jul 08 '21 at 07:24
  • The first node is the master node yes, but i'm testing force shutdowns on the first worker node which is node 2 in short. The pod i posted above is a single replica working on the first worker node, and I want it to move to second worker node automatically. – Nyquillus Jul 08 '21 at 07:28
  • than kubernetes should handle it by default as the pod is handled by a statefulset. the statefulset controller should see, that the desired replica count is not equal to the actual replica count and therefore schedule another pod at the second worker node. what is the output of "kubectl get nodes" after you have forcefully shut down the first worker node? – meaningqo Jul 08 '21 at 07:30
  • I get the output "Node is not ready" after shutdown. No transfer process is done for the pod. – Nyquillus Jul 08 '21 at 08:08
  • oh i just realised that it is a simple pod. simple pods won't get rescheduled. only pods that are handled by a deployment/replicaset/statefulset/replicationcontroller will get automatically be scheduled on another node. see @coderanger 's answer – meaningqo Jul 08 '21 at 08:46

2 Answers2

4

Answer from coderanger remains valid regarding Pods. Answering to your last edit:

Your issue is with CSI.

  • When your Pod uses a PersistentVolume whose accessModes is RWO.

  • And when the Node hosting your Pod gets unreachable, prompting Kubernetes scheduler to Terminate the current Pod and create a new one on another Node

Your PersistentVolume can not be attached to the new Node.

The reason for this is that CSI introduced some kind of "lease", marking a volume as bound.

With previous CSI spec & implementations, this lock was not visible, in terms of Kubernetes API. If your ceph-csi deployment is recent enough, you should find a corresponding "VolumeAttachment" object that could be deleted, to fix your issue:

# kubectl get volumeattachments -n ci
NAME                                                                   ATTACHER           PV                                         NODE                ATTACHED   AGE
csi-194d3cfefe24d5f22616fabd3d2fb2ce5f79b16bdca75088476c2902e7751794   rbd.csi.ceph.com   pvc-902c3925-11e2-4f7f-aac0-59b1edc5acf4   melpomene.xxx.com   true       14d
csi-24847171efa99218448afac58918b6e0bb7b111d4d4497166ff2c4e37f18f047   rbd.csi.ceph.com   pvc-b37722f7-0176-412f-b6dc-08900e4b210d   clio.xxx.com        true       90d
....
kubectl delete -n ci volumeattachment csi-xxxyyyzzz

Those VolumeAttachments are created by your CSI provisioner, before the device mapper attaches a volume.

They would be deleted only once the corresponding PV would have been released from a given Node, according to its device mapper - that needs to be running, kubelet up/Node marked as Ready according to the the API. Until then, other Nodes can't map it. There's no timeout, should a Node get unreachable due to network issues or an abrupt shutdown/force off/reset: its RWO PV are stuck.

See: https://github.com/ceph/ceph-csi/issues/740

One workaround for this would be not to use CSI, and rather stick with legacy StorageClasses, in your case installing rbd on your nodes.

Though last I checked -- k8s 1.19.x -- I couldn't manage to get it working, I can't recall what was wrong, ... CSI tends to be "the way" to do it, nowadays. Despite not being suitable for production use, sadly, unless running in an IAAS with auto-scale groups deleting Nodes from the Kubernetes API (eventually evicting the corresponding VolumeAttachments), or using some kind of MachineHealthChecks like OpenShift 4 implements.

SYN
  • 4,476
  • 1
  • 20
  • 22
  • 1
    exactly, until csi can unbind the pv from the dead worker node, the pod will never start and when you'll describe it, it'll say that it's waiting for the volume to be attached to the worker node it got recreated on. – DevLounge Jul 09 '21 at 22:33
  • What you mean with "installing rbd" is creating block device manually through ceph and mounting it to the nodes right? – Nyquillus Jul 12 '21 at 07:23
  • Installing RBD binaries on your nodes would allow you to use the legacy/non-CSI Ceph StorageClass, attaching volumes to your nodes without that locking mechanism. Though I'm not sure provisioning volumes would still work with a recent kube-controllers (at least if it runs in a container, I think rbd binaries would be missing) – SYN Jul 12 '21 at 11:05
2

A bare Pod is a single immutable object. It doesn't have any of these nice things. Related: never ever use bare Pods for anything. If you try this with a Deployment you should see it spawn a new one to get back to the requested number of replicas. If the new Pod is Unschedulable you should see events emitted explaining why. For example if only node 1 matches the nodeSelector you specified, or if another Pod is already running on the other node which triggers the anti-affinity.

coderanger
  • 52,400
  • 4
  • 52
  • 75
  • I'll try something asap and give feedback, thanks. – Nyquillus Jul 08 '21 at 08:55
  • Noticed something, I've added kubeapps a while ago by using its helm chart (bitnami/kubeapps), when i shutdown a machine all the pods created by it are redeployed on the other nodes (including master node not sure why) even though i didnt specify any affinity parameters on the chart. I also have a simple nginx pod with "Deployment" kind set to one replica. That pod did not get recreated on another node. It stuck on "terminating" and deleting the pod doesnt help either. – Nyquillus Jul 08 '21 at 10:54