Kubernetes jobs and back-off limit values: is the value a number of retries or minutes?

Question

I was reading the Kubernetes documentation about jobs and retries. I found this:

There are situations where you want to fail a Job after some amount of retries due to a logical error in configuration etc. To do so, set .spec.backoffLimit to specify the number of retries before considering a Job as failed. The back-off limit is set by default to 6. Failed Pods associated with the Job are recreated by the Job controller with an exponential back-off delay (10s, 20s, 40s …) capped at six minutes. The back-off count is reset if no new failed Pods appear before the Job’s next status check.

I had two questions about the above quote:

The back-off limit value is on minutes or number of retries? The documentation example using the value 6 (six) is confuse, because he initially affirms that the value is the number of retries but after that said "capped at six minutes".
There is a way to define the back-off delay time? As I understand, this behavior (10s, 20s, 40s …) is default and can't be changed.

score 9 · Accepted Answer · answered Aug 08 '19 at 18:05

No confusion about the .spec.backoffLimit is is the number of retries.

The Job controller recreates the failed Pods (associated with the Job) in an exponential delay (10s, 20s, 40s, ... , 360s). And of course, this delay time is set by the Job controller.

If the Pod fails, after 10s new Pod will be created
If fails again, after 20s new one will be created
If fails again, after 40s new one comes
If fails again, next one comes after 80s (1m 20s)
If fails again, next one comes after 160s (2m 40s)
If fails again, after 320s (5m 20s), new Pod comes
If fails again, after 360s (not 640s, cause it is greater than 360s or 6m) you will see the next one

score 0 · Answer 2 · answered Aug 16 '22 at 23:37

By looking at the source code, it seems like the backoffLimit attribute specifies the failure count rather than failure time.

Excerpt of the code mentioned above:

func (jm *Controller) syncJob(ctx context.Context, key string) (forget bool, rErr error) {
    // ...

    succeeded, failed := getStatus(&job, pods, uncounted, expectedRmFinalizers)

    // ...

    jobHasNewFailure := failed > job.Status.Failed
    exceedsBackoffLimit := jobHasNewFailure && (active != *job.Spec.Parallelism) &&
        (failed > *job.Spec.BackoffLimit)

    // ...
}

Kubernetes jobs and back-off limit values: is the value a number of retries or minutes?

2 Answers2