I inherited an AKS cluster running in Switzerland north. This region doesn't provide ZRS-managed disk, only LRS. Switching to ReadWriteMany (Azure File) is not an option.
I have one system node pool in all (three) availability zones. Also, I have a custom storage class that allows for dynamic block storage provisioning. Next, I have a stateful set defining a persistent volume claim template.
allowVolumeExpansion: true
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
annotations:
storageclass.kubernetes.io/is-default-class: true
name: my-block-sc
parameters:
cachingmode: ReadOnly
diskEncryptionSetID: ...
diskEncryptionType: EncryptionAtRestWithCustomerKey
networkAccessPolicy: DenyAll
skuName: StandardSSD_LRS
provisioned: disk.csi.azure.com
reclaimPolicy: Retain
volumeBindingMode: WaitForFirstCustomer
Now, from time to time, pods get stuck in a pending state. This is because the default scheduler tries to create a pod on a node, not in the same zone as the PV (LRS disk).
Of course, I could configure a node affinity and bind all pods to a single zone. But then I can't profit from HA and pods being spread across zones.
So, how can I configure a stateful set so that, after a crash or restart of a pod, the pod gets scheduled again in the same zone?
Is there some dynamic way of providing a node affinity to a pod template spec?