How can I configuring StatefulSets for Zone-Affinity Pod scheduling with LRS disks on AKS?

Question

I inherited an AKS cluster running in Switzerland north. This region doesn't provide ZRS-managed disk, only LRS. Switching to ReadWriteMany (Azure File) is not an option.

I have one system node pool in all (three) availability zones. Also, I have a custom storage class that allows for dynamic block storage provisioning. Next, I have a stateful set defining a persistent volume claim template.

allowVolumeExpansion: true
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
  annotations:
    storageclass.kubernetes.io/is-default-class: true
  name: my-block-sc
parameters:
  cachingmode: ReadOnly
  diskEncryptionSetID: ...
  diskEncryptionType: EncryptionAtRestWithCustomerKey
  networkAccessPolicy: DenyAll
  skuName: StandardSSD_LRS
provisioned: disk.csi.azure.com
reclaimPolicy: Retain 
volumeBindingMode: WaitForFirstCustomer

Now, from time to time, pods get stuck in a pending state. This is because the default scheduler tries to create a pod on a node, not in the same zone as the PV (LRS disk).

Of course, I could configure a node affinity and bind all pods to a single zone. But then I can't profit from HA and pods being spread across zones.

So, how can I configure a stateful set so that, after a crash or restart of a pod, the pod gets scheduled again in the same zone?

Is there some dynamic way of providing a node affinity to a pod template spec?

How can I configuring StatefulSets for Zone-Affinity Pod scheduling with LRS disks on AKS?

0 Answers0