2

I have to process tasks stored in a work queue and I am launching this kind of Job to do it:

apiVersion: batch/v1
kind: Job
metadata:
  name: pi
spec:
  template:
    spec:
      parallelism: 10
      containers:
      - name: pi
        image: perl
        command: ["some",  "long", "command"]
      restartPolicy: Never
  backoffLimit: 0

The problem is that if one of the Pod managed by the Job fails, the Job will terminate all the other Pods before they can complete. On my side, I would like the Job to be marked as failed but I do not want its Pods to be terminated. I would like them to continue running and finish processing the items they have picked in the queue.

Is there a way to do that please?

Wytrzymały Wiktor
  • 11,492
  • 5
  • 29
  • 37
Fabrice Jammes
  • 2,275
  • 1
  • 26
  • 39
  • Jobs are meant to run pods to completion. What are you trying to achieve by trying to keep the pods running? – rock'n rolla Sep 21 '21 at 10:28
  • You can run a more complex command as the container command and sleep for a long time in that command. E.g. `sh -c "some long command || sleep 1000"` – therealak12 Sep 21 '21 at 10:31
  • @rock'nrolla, let's say I have to process a queue of 10 items. Each Pod will process an item. In the current case, if one Pod fails, then the other pod will be terminated and won't process their Item. What I want is them to continue processing their item to completion. – Fabrice Jammes Sep 21 '21 at 10:55
  • what happen if you set `restartPolicy: OnFailure`? – gohm'c Sep 21 '21 at 10:57
  • Can you run the workers via a Deployment instead; and have a set of long-running workers, instead of trying to launch a Kubernetes Job per task? – David Maze Sep 21 '21 at 11:24

1 Answers1

1

As it already mentioned in the comments, you can set restartPolicy: OnFailure, that means kubelet will perform restarts until the Job succeeds. However every retry doesn't increment the number of failures. However, you can set activeDeadlineSeconds to some value in order to avoid a loop of failing.

Bazhikov
  • 765
  • 3
  • 11