pod crashed find what job it was doing, and how to assign same job to another pod

Question

I have pods working on some jobs(doing some processing on some files) and suppose one of the pods gets stuck or crashes is there was to find out which file the pod was working on and save the processed data the pod had processed till the pod got crashed and kill the pod and assign the job again to the new pod. we already know which pod has crashed. what are the ways by which this can be achieved?

TheCoolDrop · Answer 1 · 2021-12-20T07:05:41.567

usually the workload resumability and workload persistence is not taken care of by Kubernetes, and is the burden of the developer.

For example if you would like to have your job resumable, then before starting any work on the file, you should save the name of the file into some external database. This way if your job fails during processing of single file, you can know which file it has started working on, but has not finished, and then you can redo this file in some other Pod.

An example workflow may look like follows:

Start processing
Check in database if there are any files whose processing has been started, but the Pod doing the processing is not there anymore
If there are such files take them, and enter the Pod name into the table as current asignee working on that file
After processing all such files proceed with regular files
Start working on regular file
Write into the database the name of the file and the name of the Pod doing the processing currently
If successful delete the entry from database

Now this is a very cross-cutting and frameworky concern to handle, and usually such steps of resuming and checking are done by frameworks. If you are working with Java then I would recommend Spring Batch framework

although we can get the name of the file, how can we save the data it has processed until it failed or got stuck — Sami Ullah, Dec 20 '21 at 07:03
You have to save it to some persistent solution, like a database or shared mounted volume. — TheCoolDrop, Dec 20 '21 at 07:06

pod crashed find what job it was doing, and how to assign same job to another pod

1 Answers1