Downtime on GCP load balancer after switching node pool on a backend service

Question

We have a blue green deployment system in place that we are using for quite a while. We have two backend services on the load balancer. One service is for test and another is for production. There are 2 different node pools (k8s) used behind those backend services. To deploy new version to production we simply change instance group on the production backend service. It did work for quite a while until June 2019. After switching instance group for short period of time (about 2-3 mins) backend service is not available and LB respond with 502 error.

I've also created an issue in GCP bug tracker that includes screenshots and steps to reproduce - https://issuetracker.google.com/issues/136020917

score 1 · Answer 1 · answered Dec 10 '19 at 23:45

1

This is an Expected Behavior. Changes to your backend services are not instantaneous. It can take several minutes for your changes to propagate throughout the network.

Best practice is that before you make any changes, create an instance group. Then wait for it to become healthy and verify traffic flow. After that the other can be deleted.

answered Dec 10 '19 at 23:45

GagandeepT

288
1
7

1

But instance group is healthy. I described it in the linked issue, but I've also tried to add 2 instance groups to the backend (both healthy) and traffic was flowing without any problems. But when I removed the old instance group then I got another 2-3 mins downtime. – dimka Dec 12 '19 at 04:04
1

And what would be an alternative then? What I'm trying to do is to switch between to GKE clusters on the load balancer. I can't swap backend services because then Google Cloud CDN cache will be wiped. Load balancer is not part of GKE setup. – dimka Dec 12 '19 at 05:17

score 0 · Answer 2 · answered Dec 12 '19 at 23:59

0

This is an intended behaviour and as per the documentation in GCP any changes made to the backend services in a load balancer makes the backends inaccessible through load balancer for few minutes.

This being said,I would recommend following this documentation for performing rolling updates in GKE. Using this update will eliminate the downtime.

[1]Performing rolling updates https://cloud.google.com/kubernetes-engine/docs/how-to/updating-apps

answered Dec 12 '19 at 23:59

Dattu Pragnu Nellutla

21
2

1

Thanks for the link. We are using rolling updates. The setup mentioned in the question is for cluster update, for example upgrading k8s master version. So, what we really want to do is to switch between GKE clusters. Also, we can't use separate backend services on LB because we want to preserve CDN cache. What surprise me is that we were doing those updates before without any downtime and then it stopped working at some point. – dimka Dec 13 '19 at 04:22

Downtime on GCP load balancer after switching node pool on a backend service

2 Answers2