2

On an environment with more than one node and using Ceph block volumes in RWO mode, if a node fails (is unreachable and will not come back soon) and the pod is rescheduled to another node, the pod can't start if it has a Ceph block PVC. The reason is that the volume is 'still being used' by the other pod (because as the node failed, its resources can't be removed properly).

If I remove the node from the cluster using kubectl delete node dead-node, the pod can start because the resources get removed.

How can I do this automatically? Some possibilities I have thought about are:

  • Can I set a force detach timeout for the volume?
  • Set a delete node timeout?
  • Automatically delete a node with given taints?

I can use the ReadWriteMany mode with other volume types to be able to let the PV be used by more than one pod, but it is not ideal.

David Medinets
  • 5,160
  • 3
  • 29
  • 42

1 Answers1

0

You can probably have a sidecar container and tweak the Readiness and Liveness probes in your pod so that the pod doesn't restart if a Ceph block volume is unreachable for some time by the container that it's using it. (There may be other implications to your application though)

Something like this:

apiVersion: v1
kind: Pod
metadata:
  labels:
    test: ceph
  name: ceph-exec
spec:
  containers:
  - name: liveness
    image: k8s.gcr.io/busybox
    args:
    - /bin/sh
    - -c
    - touch /tmp/healthy; sleep 30; rm -rf /tmp/healthy; sleep 600
    livenessProbe:
      exec:
        command:
        - cat
        - /tmp/healthy
      initialDelaySeconds: 5
      periodSeconds: 5
  - name: cephclient
    image: ceph
    volumeMounts:
    - name: ceph
      mountPath: /cephmountpoint
    livenessProbe:
      ...  something
      initialDelaySeconds: 5
      periodSeconds: 3600  make this real long

✌️☮️

Rico
  • 58,485
  • 12
  • 111
  • 141
  • 1
    There is a chance you misunderstood the problem: OP has a problem with a volume being "attached" to an unhealthy node, which won't be detached without manual intervention. – zerkms Jul 29 '20 at 05:56
  • Thanks. Yes, I was thinking of preventing that problem in the first place – Rico Jul 29 '20 at 11:45
  • I don't think you can prevent that: say a node lost its power, or kernel panicked, or cpu has burned, or kubelet has crashed, or anything else has happened. There are millions of ways to get a node unhealthy. – zerkms Jul 29 '20 at 23:59
  • yeah, but doesn't Ceph provide that redundancy? meaning if one if its nodes goes down? Isn't it a 'virtual volume'? My thought about preventing it was not about preventing whether a ceph node goes down, but more about preventing the pod from restarting if a ceph volume becomes unresponsive for some time and then itself recovers later, based on ceph redundancy. I may be misunderstanding how ceph works. – Rico Jul 30 '20 at 00:09
  • I believe the question is about _clients_ not ceph servers. "The reason is that the volume is 'still being used' by the other pod (because as the node failed, its resources can't be removed properly)." So the client node goes away, kubernetes still sees the volume as attached, and won't resolve unless you interfere manually. – zerkms Jul 30 '20 at 00:21
  • Yeah, so my answer is a take on what if you never remove that client and let ceph just recover. Then there wouldn't be a case where you have to care about ceph clients having volumes where pods crash. That's what I mean by preventing the problem in first place. – Rico Jul 30 '20 at 02:01
  • Let us [continue this discussion in chat](https://chat.stackoverflow.com/rooms/218859/discussion-between-zerkms-and-rico). – zerkms Jul 30 '20 at 02:04