On an environment with more than one node and using Ceph block volumes in RWO mode, if a node fails (is unreachable and will not come back soon) and the pod is rescheduled to another node, the pod can't start if it has a Ceph block PVC. The reason is that the volume is 'still being used' by the other pod (because as the node failed, its resources can't be removed properly).
If I remove the node from the cluster using kubectl delete node dead-node
, the pod can start because the resources get removed.
How can I do this automatically? Some possibilities I have thought about are:
- Can I set a force detach timeout for the volume?
- Set a delete node timeout?
- Automatically delete a node with given taints?
I can use the ReadWriteMany
mode with other volume types to be able to let the PV be used by more than one pod, but it is not ideal.