I was asked to post this here by Azure's Twitter Support (instead of on ServerFault.com).
Our Kubernetes environment has been working wonderfully for over a week without needing changes, with 24 VHDs all using Container Services on Azure.
Then we suddenly receive alerts that all services have stopped working. All pods using Persistent Volume Claims are stuck on ContainerCreating. A quick kubectl describe pod podname
shows:
Unable to mount volumes for pod "***-1370023040-st581_default(9b050936-1baa-11e7-9b77-000d3ab513dc)": timeout expired waiting for volumes to attach/mount for pod "default"/"***-1370023040-st581". list of unattached/unmounted volumes=[***-persistent-storage]
and
Error syncing pod, skipping: timeout expired waiting for volumes to attach/mount for pod "default"/"***-1370023040-st581". list of unattached/unmounted volumes=[***-persistent-storage]
on all of the pods.
In the Azure Portal I can see against the agent that there is only the Agent OS VHD attached as a Disk. Manual attempts to add the disks fail with:
Failed to update disks for the virtual machine 'k8s-agent-CD93CDEA-0'. Error: A disk named '***mgmt-dynamic-pvc-018bdc6e-161a-11e7-8ca8-000d3ab513dc.vhd' already uses the same VHD URL …https://***.blob.core.windows.net/vhds/***mgmt-dynamic-pvc-018bdc6e-161a-11e7-8ca8-000d3ab513dc.vhd ….
Restarting the agent/master also doesn't clear the problem.
We are using an F16S for the agent which supports 32 data disks.
How do you reattach the VHDs to get going again?