0

I have pods working on some jobs(doing some processing on some files) and suppose one of the pods gets stuck or crashes is there was to find out which file the pod was working on and save the processed data the pod had processed till the pod got crashed and kill the pod and assign the job again to the new pod. we already know which pod has crashed. what are the ways by which this can be achieved?

Sami Ullah
  • 717
  • 8
  • 14

1 Answers1

2

usually the workload resumability and workload persistence is not taken care of by Kubernetes, and is the burden of the developer.

For example if you would like to have your job resumable, then before starting any work on the file, you should save the name of the file into some external database. This way if your job fails during processing of single file, you can know which file it has started working on, but has not finished, and then you can redo this file in some other Pod.

An example workflow may look like follows:

  • Start processing
  • Check in database if there are any files whose processing has been started, but the Pod doing the processing is not there anymore
  • If there are such files take them, and enter the Pod name into the table as current asignee working on that file
  • After processing all such files proceed with regular files
  • Start working on regular file
  • Write into the database the name of the file and the name of the Pod doing the processing currently
  • If successful delete the entry from database

Now this is a very cross-cutting and frameworky concern to handle, and usually such steps of resuming and checking are done by frameworks. If you are working with Java then I would recommend Spring Batch framework

TheCoolDrop
  • 796
  • 6
  • 18