0

My goal is to do a zero downtime updates on kubernetes.

But, there is a problem related to a file upload.

the situation is when user uploads a file, webserver stores it first. and WAS saves metadatas of a file to DB.

so the problem is when we updates webservers. webserver are not gonna wait for the request to be finished. and the file uploads/downloads services are gonna be failed(if clients are conneted to a webserver which is gonna shut down).

what am i supposed to do about this?

  • 1
    Here is a nice article explaining how to do it: https://learnk8s.io/graceful-shutdown Is this what you are looking for? – Matt Mar 15 '21 at 11:48
  • I think your process should have PID 1 inside a container as this is special PID which receives `SIGTERM`. So you can access container in NON production and run `kill -15 1`. If behaviour is not acceptable, you can improve app to close operations gracefully or you can increase `terminationGracePeriodSeconds` to longest possible expected operation in seconds in `Deployment` or `StatefulSet`. When you are able to handle connections gracefully using these methods, your app is ready for pod deletions aka zero downtime. Is this what you need? – laimison Mar 15 '21 at 14:32
  • @laimison The webserver I am using it can't gracefully shutdown. it just wait for time that i set up. And we don't know when file uploading(user requests) is gonna be finished. so we can't setup the time for waiting. what i need is an architecture of this kind of a service. what i wonder is the way of updating webserver. when there might be a user file uploading/downloading request. – ABCD1133 Mar 16 '21 at 04:48
  • As far as I know when you scale down (or delete a pod) new connections are not routed anymore so this is a question how to deal with outstanding connections. I believe it's really on app side. Kubernetes doesn't have some magic tool apart from mentioned methods. – laimison Mar 16 '21 at 10:25
  • You can even have few hours or more of `terminationGracePeriodSeconds` for these long connections to finish, but I believe it's recommended to use file upload in chunks. It helps to solve client side interruptions/satisfaction too. I meant in frontend it could be equivalent tool to this http://resumablejs.com . Retry is also important factor in every HA scenario. So if something is terminated inside app, it should retry multiple times. In general. In this particular case it will hit a new pod to continue chunks and release old pod. – laimison Mar 16 '21 at 14:54
  • @ABCD1133 I have added an answer, please give me a feedback if it helped somehow by voting for it, marking as answered or continuing conversation. Cheers – laimison Mar 19 '21 at 01:02

1 Answers1

1

In short, there is no magic tool in Kubernetes that can solve it for any type of application.

What is the main goal?

  • Delete a pod (if you can delete a pod gracefully that is big step)

  • App supports both versions at the same time (for roll out and roll back)

So how to achieve zero downtime deployments and updates?

Kubernetes/Docker:

  • Application is running as special PID 1 so it can receive SIGTERM (standard graceful shutdown) signal directly

  • You specify terminationGracePeriodSeconds in StatefulSet or Deployment. When you scale down an application (or delete a pod to replace with new pod), no new connections are routed, it sends SIGTERM to app and waits for terminationGracePeriodSeconds time. Usually it's up to 5-10 minutes to drain connections, but could be even hours to finish long ones. If app understands SIGTERM as you wanted it can finish this earlier.

  • Just working readinessProbe check

Application:

  • Ideally understands SIGTERM and closes operations gracefully

  • App should be able to retry connection or operation if something failed in first attempt (e.g. API call from frontend to backend, DB query from backend, etc.) - this helps to retry operation on new pod and in general retry is a good thing in a highly available systems

  • App does its job in smaller chunks and with mentioned retry needs, for example using http://resumablejs.com to avoid long file transfers - long connections

  • Strategy for schema changes, both versions should be supported at the same time (so for instance if you add new column to DB, it's better to add it in first release, then use it in second release and so many other techniques)

Application (last resort):

  • Some companies which cannot afford downtime, but it's to complicated/not possible to deploy a new version decide to queue new connections (additional code) while app is upgrading. So files and DB records are imported from old into new version to finish zero downtime deployment.
laimison
  • 1,409
  • 3
  • 17
  • 39