I set up a (very) simple deployment with GKE on a GKE Autopilot cluster running the latest version of Kubernetes (1.18.15-gke.1501
) and attached an ingress (external HTTP(s) load balancer) that links to a simple ClusterIP
service.
Whenever I update the deployment with a new image, I experience about 5-15 minutes of downtime where the load balancer returns a 502 error. It seems like the control plane creates the new, updated pod, allows the service-level health checks to go through (not the load-balancer ones, it doesn't create the NEG yet), then kills the older pod while at the same time setting up the new NEG. It then doesn't remove the old NEG until a variable amount of time later.
Logs on the pods show that health checks are going through, but the GKE dashboard show inconsistent results for the Ingress' state. The ingress will show as fine, but the service will 502.
Things I've tried
- Increasing the number of pods from 1 to 3. This helped on some deploys, but on every other deploy it increased the amount of time it took for the load balancer to resolve correctly.
- Attempted setting
maxSurge
to 1 andmaxUnavailable
to 0. This did not improve the downtime at all. - Adding
lifecycle.preStop.exec.command: ["sleep", "60"]
to the container on the deployment. This was suggested in the GKE docs here. - Recreating the ingress, service, deployments, and clusters multiple times.
- Adding a
BackendConfig
to the service that adds slower draining on it. - Adding a readiness gate found in the docs that's supposed to fix this, but for some reason doesn't?
None of the above have helped or made any noticeable difference in how long things were down.
I'm really, really confused by why this isn't working. It feels like I'm missing something really obvious, but this is also such a simple config that you'd think it'd... just work?? Anyone have any idea on what's going on?
Config files
Deployment config:
apiVersion: apps/v1
kind: Deployment
metadata:
name: foundation-deployment
spec:
replicas: 3
strategy:
type: RollingUpdate
rollingUpdate:
maxUnavailable: 0
maxSurge: 1
selector:
matchLabels:
app: foundation-web
template:
metadata:
labels:
app: foundation-web
spec:
readinessGates:
- conditionType: "cloud.google.com/load-balancer-neg-ready"
serviceAccountName: foundation-database-account
containers:
# Run Cloud SQL proxy so we can safely connect to Postgres on localhost.
- name: cloud-sql-proxy
image: gcr.io/cloudsql-docker/gce-proxy:1.17
resources:
requests:
cpu: "250m"
memory: 100Mi
limits:
cpu: "500m"
memory: 100Mi
command:
- "/cloud_sql_proxy"
- "-instances=nine-foundation:us-central1:foundation-staging=tcp:5432"
securityContext:
runAsNonRoot: true
# Main container config
- name: foundation-web
image: gcr.io/project-name/foundation_web:latest
imagePullPolicy: Always
lifecycle:
preStop:
exec:
command: ["sleep", "60"]
env:
# Env variables
resources:
requests:
memory: "500Mi"
cpu: "500m"
limits:
memory: "1000Mi"
cpu: "1"
livenessProbe:
httpGet:
path: /healthz
port: 4000
initialDelaySeconds: 10
periodSeconds: 10
readinessProbe:
httpGet:
path: /healthz
port: 4000
initialDelaySeconds: 10
periodSeconds: 10
ports:
- containerPort: 4000
Service config:
apiVersion: v1
kind: Service
metadata:
name: foundation-web-service
annotations:
cloud.google.com/neg: '{"ingress": true}'
cloud.google.com/backend-config: '{"ports": {"4000": "foundation-service-config"}}'
spec:
type: ClusterIP
selector:
app: foundation-web
ports:
- port: 4000
targetPort: 4000
BackendConfig:
apiVersion: cloud.google.com/v1
kind: BackendConfig
metadata:
name: foundation-service-config
spec:
# sessionAffinity:
# affinityType: "GENERATED_COOKIE"
# affinityCookieTtlSec: 120
connectionDraining:
drainingTimeoutSec: 60
Ingress config:
apiVersion: networking.k8s.io/v1beta1
kind: Ingress
metadata:
name: foundation-web-ingress
labels:
name: foundation-web-ingress
spec:
backend:
serviceName: foundation-web-service
servicePort: 4000