GKE master node upgrade constantly failing

Question

I upgraded google cloud kubernetes cluster from 1.16.9-gke.2 to 1.16.9-gke.6 and it failed.

All cluster resources were brought up, but: component kube-apiserver from endpoint gke-3a1fa906f95b728d035e-59bc is unhealthy.

I'm having this issue constantly. My production server is completely inaccessible. No way to roll back. I had this issue before during the previous update. It resolved after 3-4 times. This time I update 6-7 times. No luck.

Master of cluster [flow-env-prod] will be upgraded from version 
[1.16.9-gke.2] to version [1.16.9-gke.6]. This operation is 
long-running and will block other operations on the cluster (including
 delete) until it has run to completion.

some logs:

Upgrading flow-env-prod...done.                                                                                                                                                                                                                
ERROR: (gcloud.container.clusters.upgrade) Operation [<Operation
 clusterConditions: [<StatusCondition
 message: u'All cluster resources were brought up, but: component "kube-apiserver" from endpoint "gke-3a1fa906f95b728d035e-59bc" is unhealthy.'>]
 detail: u'All cluster resources were brought up, but: component "kube-apiserver" from endpoint "gke-3a1fa906f95b728d035e-59bc" is unhealthy.'
 endTime: u'2020-06-23T02:09:20.859081264Z'
 name: u'operation-1592876809477-ed87ffb3'
 nodepoolConditions: []
 operationType: OperationTypeValueValuesEnum(UPGRADE_MASTER, 3)
 selfLink: u'https://container.googleapis.com/v1/projects/[projectID]/zones/europe-west1-b/operations/operation-1592876809477-ed87ffb3'
 startTime: u'2020-06-23T01:46:49.477981885Z'
 status: StatusValueValuesEnum(DONE, 3)
 statusMessage: u'All cluster resources were brought up, but: component "kube-apiserver" from endpoint "gke-3a1fa906f95b728d035e-59bc" is unhealthy.'
 targetLink: u'https://container.googleapis.com/v1/projects/[projectID]/zones/europe-west1-b/clusters/flow-env-prod'
 zone: u'europe-west1-b'>] finished with error: All cluster resources were brought up, but: component "kube-apiserver" from endpoint "gke-3a1fa906f95b728d035e-59bc" is unhealthy.`

Could you provide your exact steps how did you upgrade this cluster? Is this issue is still ongoing? how long does it take? — PjoterS, Jun 23 '20 at 10:17
Just used UI button and clicked upgrade master. I also did the same via CLI. Cluster access restored after I reported it to support, thought the version of the master is still the same. I'm going to try the update later on. — Maksym D., Jun 23 '20 at 14:53
Cool. Please update your question with future details. Also are you using any features or preemptive nodes? — PjoterS, Jun 23 '20 at 16:11
No preemptive nodes. Nothing very special. I have a related DNS issue. The whole day DNS is requests constantly failing, both external and internal. It starts occurring after a failed upgrade. — Maksym D., Jun 24 '20 at 02:10
I am not able to reproduce this behaviour on my cluster. Are you have any specific GKE features, HPA, CA, are you using any hook like in [this example](https://blog.jetstack.io/blog/gke-webhook-outage/)? How did you create this cluster, did you use terraform, using UI, CLI? Are you still able to reproduce this behaviour? Did you try to create new cluster and upgrade it, does the same issue occurs? — PjoterS, Jun 30 '20 at 10:34
No, our solution is plain default. The issue was investigated by the core team of GKE, thanks for your support. — Maksym D., Sep 29 '20 at 03:07

GKE master node upgrade constantly failing

0 Answers0