3

I have a downtime on my app running on GKE when I deploy it using rolling update.

rollingUpdate:
      maxSurge: 25%
      maxUnavailable: 0
    type: RollingUpdate

I've checked the events on my pod and the last event is this one:

NEG is not attached to any Backend Service with health checking. Marking condition "cloud.google.com/load-balancer-neg-ready" to True.

On my pod I have a livenessProbe like this:

livenessProbe:
      failureThreshold: 1
      httpGet:
        path: /healthz
        port: http
        scheme: HTTP
      periodSeconds: 10
      successThreshold: 1
      timeoutSeconds: 1

startupProbe:
          failureThreshold: 30
          httpGet:
            path: /healthz
            port: http
            scheme: HTTP
          periodSeconds: 10
          successThreshold: 1
          timeoutSeconds: 1

Checked my LB logs and found this:

{
httpRequest: {
latency: "0.002246s"
remoteIp: "myIP"
requestMethod: "GET"
requestSize: "37"
requestUrl: "https://www.myurl/"
responseSize: "447"
status: 502
userAgent: "curl/7.77.0"
}
insertId: "1mk"
jsonPayload: {3}
logName: "myproject/logs/requests"
receiveTimestamp: "2022-02-15T15:30:52.085256523Z"
resource: {
labels: {6}
type: "http_load_balancer"
}
severity: "WARNING"
spanId: "b75e2f583a0e9e25"
timestamp: "2022-02-15T15:30:51.270776Z"
trace: "myproject/traces/32c488f48a392ac42358be0f"
}

And this is my deployment spec as asked:

spec:
  progressDeadlineSeconds: 600
  replicas: 1
  revisionHistoryLimit: 10
  selector:
    matchLabels:
      app.kubernetes.io/instance: app
      app.kubernetes.io/name: myname
  strategy:
    rollingUpdate:
      maxSurge: 25%
      maxUnavailable: 0
    type: RollingUpdate
  template:
    metadata:
      annotations:
        checksum/config: 4920135cd08336150d3184cc1af
      creationTimestamp: null
      labels:
        app.kubernetes.io/instance: app
        app.kubernetes.io/managed-by: Helm
        app.kubernetes.io/name: webapp-server
        app.kubernetes.io/part-of: webapp
        helm.sh/chart: myapp-1.0.0
    spec:
      containers:
      - env:
        - name: ENV VAR
          value: Hello
        envFrom:
        - configMapRef:
            name: myapp
        - secretRef:
            name: myapp-credentials
        image: imagelink
        imagePullPolicy: IfNotPresent
        livenessProbe:
          failureThreshold: 1
          httpGet:
            path: /healthz
            port: http
            scheme: HTTP
          periodSeconds: 10
          successThreshold: 1
          timeoutSeconds: 1
        name: node
        ports:
        - containerPort: 3000
          name: http
          protocol: TCP
        resources:
          limits:
            cpu: 500m
            memory: 512Mi
          requests:
            cpu: 250m
            memory: 256Mi
        startupProbe:
          failureThreshold: 30
          httpGet:
            path: /healthz
            port: http
            scheme: HTTP
          periodSeconds: 10
          successThreshold: 1
          timeoutSeconds: 1
        terminationMessagePath: /dev/termination-log
        terminationMessagePolicy: File
      dnsPolicy: ClusterFirst

What can I change to avoid this downtime when performing a rollingUpdate?

Wytrzymały Wiktor
  • 11,492
  • 5
  • 29
  • 37
mohamed wael thabet
  • 195
  • 2
  • 4
  • 12

2 Answers2

1

this worked by adding this:

lifecycle:
   preStop:
      exec:
        command:
        - /bin/sh
        - -c
        - sleep 60

which basically gives the pod 60 seconds to handle the sigterm and the ancient requests while the new pod is up and handles the new requests.

Wai Ha Lee
  • 8,598
  • 83
  • 57
  • 92
mohamed wael thabet
  • 195
  • 2
  • 4
  • 12
  • As it’s currently written, your answer is unclear. Please [edit] to add additional details that will help others understand how this addresses the question asked. You can find more information on how to write good answers [in the help center](/help/how-to-answer). – Community Mar 01 '22 at 15:42
  • 1
    This is a work-around rather than a real solution to the problem. Did you ever find a real solution? I posted about the same issue here: https://github.com/kubernetes/ingress-gce/issues/1718. – Raman May 19 '22 at 17:57
0

For an update with zero downtime you should consider using more than one pod.
You can also tweak your maxSurge and maxUnavailable values (1).
One-second timeout seems a bit low, consider raising those values.
Finally, you can find an extensive guide on rolling updates in the google docs.

Sergiusz
  • 1,175
  • 4
  • 13