2

I am using Helm Charts for deployments on google-kubernetes-engine and using rolling-update in it.

Currently I am running 10 pods. When I make deployment using rolling-update, I expect a new pod comes up and traffic is stopped from the old pod going down and then it is gracefully taken down. And so on for next pods.

But in my case when a new pod is created, old pod immediately goes down and I start getting Internal Server Error [500] for requests being fulfilled by that pod.

How can I avoid this?

      livenessProbe:
        httpGet:
          path: /health
          port: 4000
        initialDelaySeconds: 1
        periodSeconds: 10
      readinessProbe:
        httpGet:
          path: /health
          port: 4000
        initialDelaySeconds: 1
        periodSeconds: 10
  • do you have liveness and readiness probes configured in your deployment ? – Nicolas Pepinster Mar 26 '20 at 07:44
  • Yes, the liveness and readiness probes are configures in my deployment. – Ankush Bansal Mar 26 '20 at 07:54
  • what are the settings for probes and for update strategy? – Nick Mar 26 '20 at 12:12
  • `livenessProbe:` `httpGet:` `path: /health` `port: 4000` `initialDelaySeconds: 1` `periodSeconds: 10` `readinessProbe:` `httpGet:` `path: /health` `port: 4000` `initialDelaySeconds: 1` `periodSeconds: 10` – Ankush Bansal Mar 26 '20 at 13:34
  • This is more about how the application handles graceful shutdown. Are you having any issues with the new pods? – Patrick W Mar 26 '20 at 16:19
  • have you been trying increasing initialDelaySeconds ? 1sec might be too short (so the container doesn't start completely) – Nick Mar 26 '20 at 18:44
  • @PatrickW I am having problem with old pods. – Ankush Bansal Mar 30 '20 at 04:03
  • @Nick 1 sec could have been short if i had problem with new pods, but i am having problem with old pods going down. Still i will try increasing initialDelaySeconds. – Ankush Bansal Mar 30 '20 at 04:03
  • If I got you right, the issue is that old pods stop serving traffic with error 500, that means that somehow request arrived to old pod, but the pod itself is shut down. That is why I have asked to add to the question : 1) update strategy settings; 2) readiness / liveness probes settings. – Nick Mar 30 '20 at 06:18

1 Answers1

1

It sounds like you need to tweak your rolling update strategy. You can find similar discussions here and here concerning performing rolling updates without getting errors.

The upgrade strategy is important to define how many unavailable pods you can have during the update. For now downtime, you may want to set this to 0 and configure a reasonable maxSurge value.

The next step is to make sure that you have appropriate readinessProbes configured. Once the new pod is marked as ready, the controller will attempt to remove one (or more) of the old pods. Your pod will receive a SIGTERM and proceed to handle that however it is configured to do so. This means:

A) Make sure the readinessProbe only marks a pod as ready once it is fully able to accept traffic (/health may be up even though the application is not, make sure this is not the case).

B) Your old pods need to handle SIGTERM properly and gracefully, this is done at the application layer. Keep in mind that by default, the controller will allow the pods to shut down gracefully once the SIGTERM is sent.

Patrick W
  • 4,603
  • 1
  • 12
  • 26
  • I will try adding terminationDelayGraceperiods and also try handing SIGTERM and will update if this makes any difference. – Ankush Bansal Mar 30 '20 at 04:04
  • Since this is mostly affecting pods as they shut down, you also want to make sure that the application gracefully terminates rather than abruptly cutting connections – Patrick W Mar 30 '20 at 04:06
  • i have tried increasing terminationGracePeriodSecods, livelinessProve initialDelaySeconds, readinessProbe initialDelaySeconds and adding prestop hook but none of them made any difference. Last thing remaining is terminating application gracefully which is tricky in my case but will try if i can handle it. – Ankush Bansal Apr 02 '20 at 04:51
  • Was able to solve this by capturing SIGTERM and adding sleep which helped in stopping any new requests to old pod and also completing new requests. – Ankush Bansal Aug 19 '20 at 08:09