Kubernetes ReadWriteOnce Multi-Attach deadlock on deployment reload/restart

Question

Consider the below PersistentVolumeClaim, as well as the Deployment using it.

Being ReadWriteOnce, the PVC can only be mounted by one node at the time. As there shall only be one replica of my deployment, I figured this should be fine. However, upon restarts/reloads, two Pods will co-exist during the switchover.

If Kubernetes decides to start the successor pod on the same node as the original pod, they will both be able to access the volume and the switchover goes fine. But - if it decides to start it on a new node, which it seems to prefer, my deployment ends up in a deadlock:

Multi-Attach error for volume "pvc-c474dfa2-9531-4168-8195-6d0a08f5df34" Volume is already used by pod(s) test-cache-5bb9b5d568-d9pmd

The successor pod can't start because the volume is mounted on another node, while the original pod/node, of course, won't let go of the volume until the pod is taken out of service. Which it won't be until the successor is up.

What am I missing here?

apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: vol-test
spec:
  accessModes:
  - ReadWriteOnce
  resources:
    requests:
      storage: 10Gi
  storageClassName: do-block-storage
---
apiVersion: apps/v1
kind: Deployment
metadata:
  name: test-cache
spec:
  selector:
    matchLabels:
      app: test-cache-deployment
  replicas: 1
  template:
    metadata:
      labels:
        app: test-cache-deployment
    spec:
      containers:
      - name: test-cache
        image: myrepo/test-cache:1.0-SNAPSHOT
        volumeMounts:
          - mountPath: "/test"
            name: vol-mount
        ports:
        - containerPort: 8080
        imagePullPolicy: Always
      volumes:
        - name: vol-mount
          persistentVolumeClaim:
            claimName: vol-name
      imagePullSecrets:
      - name: regcred

As @acid_fuji points out, this is the expected behavior of ReadWriteOnce. The trouble is that I am stuck with this, as it's the only access mode available at DigitalOcean Kubernetes. Is NodeAffinity my only option, or is there a better way to use ReadWriteOnce volumes? — John-Arne Boge, Jun 05 '20 at 15:00

score 7 · Answer 1 · answered Jun 06 '20 at 10:19

I figured out a workaround:

While far from ideal, it's an acceptable compromise in my particular case.

ReadWriteOnce volumes apparently doesn't play well with Kubernetes default upgrade strategy: "Rolling" (even in the case of single replica deployments). If instead I use the "Recreate" upgrade strategy, Kubernetes will destroy the original pod before starting the successor, thus detaching the volume before it is mounted again.

...
spec:
  selector:
    matchLabels:
      app: test-cache-deployment
  replicas: 1
  strategy:
    type: Recreate
...

This solution obviously comes with a major drawback: The deployment will be offline between shutdown and successful startup - which might take anywhere from a few seconds to eternity.

acid_fuji · Accepted Answer · 2020-06-09T08:48:47.953

This is expected behavior if you use ReadWriteOnce. If you look into the manual you will encounter information:

ReadWriteOnce – the volume can be mounted as read-write by a single node

Kubernetes documentation have table that shows which PVs support ReadWriteMany (which is, write access on multiple node at the same time, e.g. NFS)

If you still insist to use ReadWriteOnce you can use NodeAffinity and make sure that 2 replicas will be scheduled to the same node but this considered bad practices as it missed the whole point of Kubernetes. Notice that if the specific node goes down all your replicas will go down.

The desired state you mentioned in the comment could be achieved with pod affinity:

Pod affinity and pod anti-affinity allow you to specify rules about how pods should be placed relative to other pods. The rules are defined using custom labels on nodes and label selectors specified in pods. Pod affinity/anti-affinity allows a pod to specify an affinity (or anti-affinity) towards a group of pods it can be placed with. The node does not have control over the placement.

Check this example from Kubernetes documentation about pods affinity.

I guess NodeAffinity could work. Especially since my example requires only 1 replica in the first place. However, I don't really want to identify my nodes - they are epemeral and I don't care if they live or die. Is there a way to use NodeAffinity without identifying one particular node? As in "start the upgraded pod on the same noe as the original pod (if it's still there)" ? — John-Arne Boge, Jun 06 '20 at 10:25
@John-ArneBoge This can be achieved with pod affinity. I have edited my question. — acid_fuji, Jun 09 '20 at 08:50

Kubernetes ReadWriteOnce Multi-Attach deadlock on deployment reload/restart

2 Answers2