1

All of our AKS clusters have the following error reported in Azure Portal:

This container service is in a failed state. Click here to open a new support request.

enter image description here

It seems we also cannot edit the cluster. When trying to scale out the nodes, I am getting the following error:

Failed to save container service 'test-aks'. Error: Operation is not allowed while cluster is being upgrading or failed in upgrade

When looking into the AKS properties, I see there is a provisioning state of "Failed":

enter image description here

We don't know how to troubleshoot this problem.

Dave New
  • 38,496
  • 59
  • 215
  • 394
  • I'd contact support and go to #sig-azure on k8s slack – 4c74356b41 Feb 11 '19 at 13:56
  • 1
    Did you do any changes to your cluster recently like upgrading to another version? – Karishma Tiwari - MSFT Feb 12 '19 at 18:39
  • 1
    Use the az aks scale command to scale the cluster nodes using Azure CLI as described here and share the results: https://learn.microsoft.com/en-us/azure/aks/scale-cluster#scale-the-cluster-nodes It is likely that you exceeded the core quota. Let me know. – Karishma Tiwari - MSFT Feb 12 '19 at 18:45
  • Any more question? Or if it's helpful you can accept it as the answer. – Charles Xu Feb 13 '19 at 09:19
  • It was because I submitted an update request to the cluster, but there were no vCPUs available in my subscription. This set the state of the provisioning update to "Failed", but with no reasoning. I had to increase my quote and rerun the update command. – Dave New Feb 13 '19 at 09:27

2 Answers2

2

Use the az aks scale command to scale the cluster nodes using Azure CLI as described here: https://learn.microsoft.com/en-us/azure/aks/scale-cluster#scale-the-cluster-nodes

az aks show --resource-group myResourceGroup --name myAKSCluster --query agentPoolProfiles

This will show you the descriptive error message in Azure CLI. It is likely that you exceeded the limit for the core quota. More details discussed on this thread: https://github.com/Azure/AKS/issues/542

1

For the issue that you shows:

This container service is in a failed state. Click here to open a new support request.

It also happened to me. Usually, there is some limitation to the user for the use of resources. On my side, I just can use 10 vCpu. So I got the error when I scale up for more nodes if the vCpu have none left. I think it's also a possible reason for you. You can take a check.

Charles Xu
  • 29,862
  • 2
  • 22
  • 39
  • It was because I submitted an update request to the cluster, but there were no vCPUs available in my subscription. This set the state of the provisioning update to "Failed", but with no reasoning. I had to increase my quote and rerun the update command. Thanks – Dave New Feb 13 '19 at 09:27
  • @davenewza You mean you increase the quote and it works? Maybe it's the limitation of other resources. You can get more details from the log. – Charles Xu Feb 13 '19 at 09:34