GKE Autopilot Ingress returns 502 error for 5-15 minutes after deploying

Question

I set up a (very) simple deployment with GKE on a GKE Autopilot cluster running the latest version of Kubernetes (1.18.15-gke.1501) and attached an ingress (external HTTP(s) load balancer) that links to a simple ClusterIP service.

Whenever I update the deployment with a new image, I experience about 5-15 minutes of downtime where the load balancer returns a 502 error. It seems like the control plane creates the new, updated pod, allows the service-level health checks to go through (not the load-balancer ones, it doesn't create the NEG yet), then kills the older pod while at the same time setting up the new NEG. It then doesn't remove the old NEG until a variable amount of time later.

Logs on the pods show that health checks are going through, but the GKE dashboard show inconsistent results for the Ingress' state. The ingress will show as fine, but the service will 502.

Things I've tried

Increasing the number of pods from 1 to 3. This helped on some deploys, but on every other deploy it increased the amount of time it took for the load balancer to resolve correctly.
Attempted setting maxSurge to 1 and maxUnavailable to 0. This did not improve the downtime at all.
Adding lifecycle.preStop.exec.command: ["sleep", "60"] to the container on the deployment. This was suggested in the GKE docs here.
Recreating the ingress, service, deployments, and clusters multiple times.
Adding a BackendConfig to the service that adds slower draining on it.
Adding a readiness gate found in the docs that's supposed to fix this, but for some reason doesn't?

None of the above have helped or made any noticeable difference in how long things were down.

I'm really, really confused by why this isn't working. It feels like I'm missing something really obvious, but this is also such a simple config that you'd think it'd... just work?? Anyone have any idea on what's going on?

Config files

Deployment config:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: foundation-deployment
spec:
  replicas: 3
  strategy:
    type: RollingUpdate
    rollingUpdate:
      maxUnavailable: 0
      maxSurge: 1
  selector:
    matchLabels:
      app: foundation-web
  template:
    metadata:
      labels:
        app: foundation-web
    spec:
      readinessGates:
        - conditionType: "cloud.google.com/load-balancer-neg-ready"

      serviceAccountName: foundation-database-account
      containers:
        # Run Cloud SQL proxy so we can safely connect to Postgres on localhost.
        - name: cloud-sql-proxy
          image: gcr.io/cloudsql-docker/gce-proxy:1.17
          resources:
            requests:
              cpu: "250m"
              memory: 100Mi
            limits:
              cpu: "500m"
              memory: 100Mi
          command:
            - "/cloud_sql_proxy"
            - "-instances=nine-foundation:us-central1:foundation-staging=tcp:5432"
          securityContext:
            runAsNonRoot: true
        # Main container config
        - name: foundation-web
          image: gcr.io/project-name/foundation_web:latest
          imagePullPolicy: Always
          lifecycle:
            preStop:
              exec:
                command: ["sleep", "60"]
          env:
           # Env variables
          resources:
            requests:
              memory: "500Mi"
              cpu: "500m"
            limits:
              memory: "1000Mi"
              cpu: "1"
          livenessProbe:
            httpGet:
              path: /healthz
              port: 4000
            initialDelaySeconds: 10
            periodSeconds: 10
          readinessProbe:
            httpGet:
              path: /healthz
              port: 4000
            initialDelaySeconds: 10
            periodSeconds: 10
          ports:
            - containerPort: 4000

Service config:

apiVersion: v1
kind: Service
metadata:
  name: foundation-web-service
  annotations:
    cloud.google.com/neg: '{"ingress": true}'
    cloud.google.com/backend-config: '{"ports": {"4000": "foundation-service-config"}}'
spec:
  type: ClusterIP
  selector:
    app: foundation-web
  ports:
    - port: 4000
      targetPort: 4000

BackendConfig:

apiVersion: cloud.google.com/v1
kind: BackendConfig
metadata:
  name: foundation-service-config
spec:
  # sessionAffinity:
  #   affinityType: "GENERATED_COOKIE"
  #   affinityCookieTtlSec: 120
  connectionDraining:
    drainingTimeoutSec: 60

Ingress config:

apiVersion: networking.k8s.io/v1beta1
kind: Ingress
metadata:
  name: foundation-web-ingress
  labels:
    name: foundation-web-ingress
spec:
  backend:
    serviceName: foundation-web-service
    servicePort: 4000

Did you ever resolve this? I can't make head or tail of the answer below and I having the exact same problem I think — Tom Greenwood, Aug 10 '21 at 15:19
Nope, I moved on. Kinda sad, GKE was great otherwise but this was an *awful* experience. — samwight, Aug 10 '21 at 20:02
Agreed, it's killing me. I'll let you know if my possible fix below works. In case it doesn't... What did you choose instead?? — Tom Greenwood, Aug 11 '21 at 08:37

Tom Greenwood · Answer 1 · 2021-08-11T10:11:02.680

I think this might be to do with the cloud sql auth proxy sidecar not terminating properly, leading to the load balancer getting in a twist.

Note this from the GCP docs (I've made the key parts italics)

Symptoms

502 errors or rejected connections. Potential causes New endpoints generally become reachable after attaching them to the load balancer, provided that they respond to health checks. You might encounter 502 errors or rejected connections if traffic cannot reach the endpoints.

502 errors and rejected connections can also be caused by a container that doesn't handle SIGTERM. If a container doesn't explicitly handle SIGTERM, it immediately terminates and stops handling requests. The load balancer continues to send incoming traffic to the terminated container, leading to errors.

The container native load balancer only has one backend endpoint. During a rolling update, the old endpoint gets deprogrammed before the new endpoint gets programmed.

Backend Pod(s) are deployed into a new zone for the first time after a container native load balancer is provisioned. Load balancer infrastructure is programmed in a zone when there is at least one endpoint in the zone. When a new endpoint is added to a zone, load balancer infrastructure is programmed and causes service disruptions.

Resolution

Configure containers to handle SIGTERM and continue responding to requests throughout the termination grace period (30 seconds by default). Configure Pods to begin failing health checks when they receive SIGTERM. This signals the load balancer to stop sending traffic to the Pod while endpoint deprograming is in progress.

By default, the proxy doesn't handle SIGTERM nicely, and wont exit gracefully on a SIGTERM (see related issue), however it now has a nice flag for handling this, so you can use something like

- name: cloud-sql-proxy
      image: gcr.io/cloudsql-docker/gce-proxy:1.23.1
      command:
        - "/cloud_sql_proxy"
        - "-instances={{ .Values.postgresConnectionName }}=tcp:5432"
        - "-term_timeout=60s"
      securityContext:
        runAsNonRoot: true
      resources:
        requests:
          memory: "2Gi"
          cpu:    "1"

Adding the term_timeout flag mainly fixed it for me but was still seeing the occasional 502 during a deployment. Upping replicas to 3 (my cluster is regional, so I wanted to cover all the zones) seemed to help, once I had the term_timeout in place.

score 0 · Answer 2 · answered Aug 14 '23 at 15:13

I believe the other answer about SIGTERM is somewhat correct, but refers to a more general challenge that kubernetes has with a race-condition. This is covered pretty well in https://github.com/puma/puma/blob/master/docs/kubernetes.md. Adding some preStop sh -c sleep 60 and terminationGracePeriod is the path to victory.

The particular issue that I've observed and with GKE auto-pilot appears to be a bit different. What I've observed is that we get these rolling 502 windows when the autopilot cluster doesn't have enough resources and needs to do a node scale up. More discussion, but no answers in https://discuss.kubernetes.io/t/gke-http-s-ingress-returns-502-error-for-5-6-minutes-after-deploying/15274.

Implementing a "balloon" https://wdenniss.com/gke-autopilot-spare-capacity appears to have worked for us at Prefab. This doesn't address the underlying issue, but it does mean that our deploys will put their pods onto existing nodes, not the new nodes and those seem immune from whatever is happening here with the google LBs and these new nodes.

GKE Autopilot Ingress returns 502 error for 5-15 minutes after deploying

Things I've tried

Config files

2 Answers2