2

I have 2 pods, one that is writing files to a persistent volume and the other one supposedly reads those files to make some calculations.

The first pod writes the files successfully and when I display the content of the persistent volume using print(os.listdir(persistent_volume_path)) I get all the expected files. However, the same command on the second pod shows an empty directory. (The mountPath directory /data is created but empty.)

This is the TFJob yaml file:

apiVersion: kubeflow.org/v1
kind: TFJob
metadata:
  name: pod1
  namespace: my-namespace
spec:
  cleanPodPolicy: None
  tfReplicaSpecs:
    Worker:
      replicas: 1
      restartPolicy: Never
      template:
        spec:
          containers:
            - name: tensorflow
              image: my-image:latest
              imagePullPolicy: Always
              command:
                - "python"
                - "./program1.py"
                - "--data_path=./dataset.csv"
                - "--persistent_volume_path=/data"
              volumeMounts:
                - mountPath: "/data"
                  name: my-pv
          volumes:
            - name: my-pv
              persistentVolumeClaim:
                claimName: my-pvc

(respectively pod2 and program2.py for the second pod)

And this is the volume configuration:

apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: my-pvc
  namespace: my-namespace
  labels:
    type: local
    app: tfjob
spec:
  accessModes:
    - ReadWriteOnce
  resources:
    requests:
      storage: 10Gi

apiVersion: v1
kind: PersistentVolume
metadata:
  name: my-pv
  namespace: my-namespace
  labels:
    type: local
    app: tfjob
spec:
  storageClassName: manual
  capacity:
    storage: 10Gi
  accessModes:
    - ReadWriteOnce
  hostPath:
    path: "/data"

Does anyone have any idea where's the problem exactly and how to fix it?

camelia
  • 41
  • 4
  • 1
    Are the pods running on the same node? The volume can only be mounted on one node at a time. What are the status of pod2 - if you do kubectl describe? – Jonas Aug 25 '21 at 20:01
  • @Jonas When I ran `kubectl get pods -o wide` I found that the 2 pods are not running on the same node. So your assumption is correct, thank you! How can I change one of them to run on the same node as the other one? – camelia Aug 25 '21 at 20:16
  • 1
    You need to add some for of pod-affinity https://kubernetes.io/docs/concepts/scheduling-eviction/assign-pod-node/ and then create new pods with that configuration. – Jonas Aug 25 '21 at 20:20

1 Answers1

2

When two pods should access a shared Persistent Volume with access mode ReadWriteOnce, concurrently - then the two pods must be running on the same node since the volume can only be mounted on a single node at a time with this access mode.

To achieve this, some form of Pod Affinity must be applied, such that they are scheduled to the same node.

Jonas
  • 121,568
  • 97
  • 310
  • 388
  • For some reason, deleting and applying the tfjob again after adding a `nodeSelector` doesn't change the node. But it works for newly created pods. Is there a way to restart the tfjob or force the changes instead of creating a new one? – camelia Aug 26 '21 at 08:26
  • Pod will be recreated anyway when moving from one node to another. If `nodeSelector` doesn't work check if this is set up correctly. – moonkotte Aug 26 '21 at 11:25
  • It didn't work at first with `nodeSelector`, but when I tried `nodeAffinity` with a required node affinity rule, it worked like a charm! Thanks @moonkotte and @Jonas for your help! – camelia Aug 26 '21 at 19:29