7

I was reading the Kubernetes documentation about jobs and retries. I found this:

There are situations where you want to fail a Job after some amount of retries due to a logical error in configuration etc. To do so, set .spec.backoffLimit to specify the number of retries before considering a Job as failed. The back-off limit is set by default to 6. Failed Pods associated with the Job are recreated by the Job controller with an exponential back-off delay (10s, 20s, 40s …) capped at six minutes. The back-off count is reset if no new failed Pods appear before the Job’s next status check.

I had two questions about the above quote:

  1. The back-off limit value is on minutes or number of retries? The documentation example using the value 6 (six) is confuse, because he initially affirms that the value is the number of retries but after that said "capped at six minutes".
  2. There is a way to define the back-off delay time? As I understand, this behavior (10s, 20s, 40s …) is default and can't be changed.
Dherik
  • 17,757
  • 11
  • 115
  • 164

2 Answers2

9

No confusion about the .spec.backoffLimit is is the number of retries.

The Job controller recreates the failed Pods (associated with the Job) in an exponential delay (10s, 20s, 40s, ... , 360s). And of course, this delay time is set by the Job controller.

  • If the Pod fails, after 10s new Pod will be created
  • If fails again, after 20s new one will be created
  • If fails again, after 40s new one comes
  • If fails again, next one comes after 80s (1m 20s)
  • If fails again, next one comes after 160s (2m 40s)
  • If fails again, after 320s (5m 20s), new Pod comes
  • If fails again, after 360s (not 640s, cause it is greater than 360s or 6m) you will see the next one
Shudipta Sharma
  • 5,178
  • 3
  • 19
  • 33
0

By looking at the source code, it seems like the backoffLimit attribute specifies the failure count rather than failure time.

Excerpt of the code mentioned above:

func (jm *Controller) syncJob(ctx context.Context, key string) (forget bool, rErr error) {
    // ...

    succeeded, failed := getStatus(&job, pods, uncounted, expectedRmFinalizers)

    // ...

    jobHasNewFailure := failed > job.Status.Failed
    exceedsBackoffLimit := jobHasNewFailure && (active != *job.Spec.Parallelism) &&
        (failed > *job.Spec.BackoffLimit)

    // ...
}
jverce
  • 36
  • 1
  • 4