Are hitless rolling updates possible on GKE with externalTrafficPolicy: Local?

Question

I have a GKE cluster (1.12.10-gke.17).

I'm running the nginx-ingress-controller with type: LoadBalancer.

I've set externalTrafficPolicy: Local to preserve the source ip.

Everything works great, except during rolling updates. I have maxSurge: 1 and maxUnavailable: 0.

My problem is that during a rolling update, I start getting request timeouts. I suspect the Google load balancer is still sending requests to the node where the pod is Terminating even though the health checks are failing. This happens for about 30-60s starting right when the pod changes from Running to Terminating. Everything stabilizes after a while and traffic eventually goes only to the new node with the new pod.

If the load balancer is slow to stop sending requests to a terminating pod, is there some way to make these rolling deploys hitless?

My understanding is that in a normal k8s service, where externalTrafficPolicy is not normal, the Google load balancer simply sends requests to all nodes and let's the iptables sort it out. When a pod is Terminating the iptables are updated quickly and traffic does not get sent to that pod anymore. In the case where externalTrafficPolicy is Local however, if the node that receives the request does not have a Running pod, then the request times out, which is what is happening here.

If this is correct, then I only see two options

stop sending requests to the node with a Terminating pod
continue servicing requests even though the pod is Terminating

I feel like option 1 is difficult since it requires informing the load balancer that the pod is about to start Terminating.

I've made some progress on option 2, but so far haven't gotten it working. I've managed to continue serving requests from the pod by adding a preStop lifecycle hook which just runs sleep 60, but I think the problem is that the healthCheckNodePort reports localEndpoints: 0 and I suspect something is blocking the request between arriving at the node and getting to the pod. Perhaps, the iptables aren't routing when localEndpoints: 0.

I've also adjusted the Google load balancer health check, which is different from the readinessProbe and livenessProbe, to the "fastest" settings possible e.g. 1s interval, 1 failure threshold and I've verified that the load balancer backend aka k8s node, indeed fails health checks quickly, but continues to send requests to the terminating pod anyway.

Did you Configure Ingress Resource to use [NGINX Ingress Controller](https://cloud.google.com/community/tutorials/nginx-ingress-gke)? by adding the annotation. This can determines which controller to utilize to serve traffic. You may need to add your ingress yaml file. annotations: kubernetes.io/ingress.class: nginx — Alioua, Feb 08 '20 at 01:30
Yep, actually everything works fine outside of doing a rolling update. The rolling update is to the same nginx ingress controller so even though there are two pods running at a given moment, they are in the same deployment and so have the same ingress class. — Jesse Shieh, Feb 08 '20 at 05:43

score 1 · Answer 1 · answered Feb 08 '20 at 17:07

There is a similar discussion here. Although it's not identical, it's a similar use case.

Everything sounds like it is working as expected.

The LoadBalancer will send traffic to any healthy node based on the LoadBalancer health check. The LoadBalancer is unaware of individual pods.
The health check will mark a node as unhealthy once the health check threshold is crossed, ie HC is sent every x seconds with x timeout delay, x number of failed requests. This causes a delay between the time that the pod goes into terminating and it is marked as unhealthy.
Also note that once the pod is marked as notReady, the pod is removed from the service endpoint. If there is no other pod on a node, traffic will continue reaching this node (because of the HC behaviour explained above), the requests can't be forwarded because of the externalTrafficPolicy (traffic remains on the node where it was sent).

There are a couple of ways to address this.

To minimize the amount of time between a terminated pod and the node being marked as unhealthy, you can set a more aggressive health check. The trouble with this is that an overly sensitive HC may cause false positives, usually increases the overhead on the node (additional health check requests), and it will not fully eliminate the failed requests.
Have enough pods running so that there are always at least 2 pods per node. Since the service removes the pod from the endpoint once it goes into notReady, requests will just get sent to the running pod instead. The downside here is that you will either have additional overhead (more pods) or a tighter grouping (more vulnerable to failure). It also won't fully eliminate the failed requests, but they will be incredibly few.
Tweak the HC and your container to work together: 3a.Have the HC endpoint be separate from the normal path you use. 3b. Configure the container readinessProbe to match the main path your container serves traffic on (it will be different from the LB HC path) 3c. Configure your image so that when SIGTERM is received, the first thing to go down is the HC path. 3d. Configure the image to gracefully drain all connections once a SIGTERM is received rather than immediately closing all sessions and connections.

This should mean that ongoing sessions will gracefully terminate which reduces errors. It should also mean that the node will start failing HC probes even though it is ready to serve normal traffic, this gives time for the node to be marked as unhealthy and the LB will stop sending traffic to it before it is no longer able to serve requests.

The problem with this last option is 2 fold. First, it is more complex to configure. The other issue is that it means your pods will take longer to terminate so rolling updates will take longer, so will any other process that relies on gracefully terminating the pod such as draining the node. The second issue isn't too bad unless you are in need of a quick turn around.

Thank you so much for the thoughtful response! I like options 1 & 2, but like you said they aren't completely hitless. The healthCheckNodePort actually is separate from the normal path which goes down as soon as the pod goes Terminating and the preStop hook triggers on SIGTERM and drains ongoing sessions well. The problem is traffic no longer reaches the pod for some reason. My guess is this is because once the health check fails, iptables or something no longer routes traffic to the pod even though it is arriving at the node. Do you know if this is true and if so how to work around it? — Jesse Shieh, Feb 08 '20 at 22:15
Once the pod goes into terminating, the kube proxy removed the pod from the iptables, so the lb HC needs to go unhealthy before it goes into terminating, I believe this can be done through the pod lifecycle, but I need to check — Patrick W, Feb 08 '20 at 22:56
The container will complete a [prestop hook before it goes into terminating state](https://kubernetes.io/docs/concepts/workloads/pods/pod-lifecycle/#container-states). Can you tweak the prestop hook to fully drain and allow time for the HC to fail? — Patrick W, Feb 09 '20 at 01:33
Apologies, I'm making a distinction between Terminating and Terminated. You're right that preStop hooks are completed before the pod is Terminated, but I think it transitions to Terminating when the preStop hook is run. Terminating perhaps is a state only reported by kubectl, which I think is a function of the deletionTimestamp, but seems to coincide with when localEndpoints becomes 0 and traffic stops reaching the pod. — Jesse Shieh, Feb 09 '20 at 03:07
I think you are spot in the previous comment, which I missed until now. We basically need a way for the load balancer health check to fail before the pod goes into Terminating. I haven't found a way to do this. The alternative is to prevent iptables from removing the pod while it is Terminating. I haven't found a way to do this either. — Jesse Shieh, Feb 09 '20 at 03:10

Are hitless rolling updates possible on GKE with externalTrafficPolicy: Local?

1 Answers1

Linked