3

I have cronjob where the pod it starts ends up in ImagePullBackOff and the cronjob never scheduled another pod , though it has to per schedule. Is there a way to force the cron controller to schedule another pod even though the previous one ended in ImagePullBackOff.

I don't want multiple pods running at the same time so use concurrencyPolicy: Forbid , Is there anyway to get CronJob to still schedule another pod ?

user2062360
  • 1,323
  • 5
  • 16
  • 29
  • 6
    You should fix the ImagePullBackOff error rather than force the cron controller to schedule. – Miffa Young Aug 25 '21 at 02:32
  • 1
    Did you try [this](https://stackoverflow.com/questions/55820054/kubernetes-cronjob-stops-scheduling-jobs/55821114#55821114)? – Mikolaj S. Aug 25 '21 at 13:29
  • 1
    The advice in the linked issue is pretty generic and isn't clear how to apply in this case. Saying "fix the ImagePullBackOff" is not the right answer if you happened to push a configuration with an invalid image id. But in that case AFAICT your cronjob is just stuck. – Leopd Jun 16 '22 at 00:22

2 Answers2

2

You don't really want the scheduler to schedule another pod. Doing that would lead to a resource leak as mentioned explained in Infinite ImagePullBackOff CronJob results in resource leak, which @VonC mentioned in his answer.

Instead you should focus on fixing the root cause to why the pod is in ImagePullBackOff. Once that is done Kubernetes will automatically pull the image, run the pod and a new one will be scheduled once the cron schedule is fullfilled.

ImagePullBackoff means that the container could not start because the image could not be retrieved. The reason could be for example an invalid image id or tag, a missing or invalid imagePullSecret or network connectivity issues.

When a pod is in ImagePullBackoff kubernetes will periodically retry to pull the image, and once the image is successfully pulled the pod starts.

The delay between pull attempts will increase with each attempt (a BackOff), as explained in the docs

Kubernetes raises the delay between each attempt until it reaches a compiled-in limit, which is 300 seconds (5 minutes).

danielorn
  • 5,254
  • 1
  • 18
  • 31
  • I agree that my answer does not answer the question as it is currently worded, and that setting `concurrencyPolicy: Allow` will actually schedule another pod despite the `imagePullBackOff`. However I fail to see any practical example where that will help, it will just create more pods in `imagePullBackOff` state until the underlying issue is resolved anyway. And at that point the pod that is already scheduled will start. But I might be missing something. – danielorn Jun 20 '22 at 19:09
  • While setting `concurrencyPolicy: Allow` will actually create pods even if the existing ones are in `imagePullBackOff` state, it will mean that pods will run in parallel once the issue causing the `imagePullBackOff` is fixed (as all created pods will start within 300 seconds to each other and thus likely run in parallel (depending on number of pods and execution time of course), which was something that was explicitly mentioned in the question as undesirable – danielorn Jun 20 '22 at 19:19
  • The issue we're seeing is that if a CronJob gets updated with an invalid imageid for whatever reason, fixing the imageid in the cronjob doesn't fix the problem. Instead of giving up on the failed job and creating a new one with the fixed imageid, it just keeps hoping the job with the invalid imageid will succeed. – Leopd Jun 22 '22 at 06:48
1

Using concurrencyPolicy: Forbid is one of the workarounds to that "feature" (reschedule a pod after a ImagePulledBackof).

See kubernetes/kubernetes issue 76570, which illustrates a drawback of said feature:

What happened:

A CronJob without a ConcurrencyPolicy or history limit that uses an image that doesn't exist will slowly consume almost all cluster resources.
In our cluster we started hitting the pod limit on all of our nodes, and began losing our ability to schedule new pods.

What you expected to happen:

Even without a ConcurrencyPolicy, CronJob should probably have the same behavior as most of the other pod schedulers.
If I try to start a deployment with X replicas and I get ImagePullBackOff on one of the containers in a pod, the deployment won't keep trying to schedule more pods on different nodes until it consumes all cluster resources.

This is especially bad with CronJob, because unlike Deployment where an upper limit for horizontal scalability has to be set, CronJob with no history limit and ConcurrencyPolicy will slowly consume all resources on a cluster.

While this is up for debate, I would personally say when a scheduled Job has the ImagePullBackOff error, it shouldn't try to keep scheduling new pods. It should probably kill the pod trying to pull an image and make a new one, or wait for the pod to successfully pull the image.

VonC
  • 1,262,500
  • 529
  • 4,410
  • 5,250