20

In k8s Cron Job Limitations mentioned that there is no guarantee that a job will executed exactly once:

A cron job creates a job object about once per execution time of its schedule. We say “about” because there are certain circumstances where two jobs might be created, or no job might be created. We attempt to make these rare, but do not completely prevent them. Therefore, jobs should be idempotent

Could anyone explain:

  • why this could happen?
  • what are the probabilities/statistic this could happen?
  • will it be fixed in some reasonable future in k8s?
  • are there any workarounds to prevent such a behavior (if the running job can't be implemented as idempotent)?
  • do other cron related services suffer with the same issue? Maybe it is a core cron problem?
Keyan P
  • 920
  • 12
  • 20
radistao
  • 14,889
  • 11
  • 66
  • 92

1 Answers1

13

The controller:

https://github.com/kubernetes/kubernetes/blob/master/pkg/controller/cronjob/cronjob_controller.go

starts with a comment that lays the groundwork for an explanation:

I did not use watch or expectations. Those add a lot of corner cases, and we aren't expecting a large volume of jobs or scheduledJobs. (We are favoring correctness over scalability.)  

If we find a single controller thread is too slow because there are a lot of Jobs or CronJobs, we we can parallelize by Namespace. If we find the load on the API server is too high, we can use a watch and UndeltaStore.) 

Just periodically list jobs and SJs, and then reconcile them.

Periodically means every 10 seconds:

https://github.com/kubernetes/kubernetes/blob/master/pkg/controller/cronjob/cronjob_controller.go#L105

The documentation following the quoted limitations also has some useful color on some of the circumstances under which 2 jobs or no jobs may be launched on a particular schedule:

If startingDeadlineSeconds is set to a large value or left unset (the default) and if concurrentPolicy is set to AllowConcurrent, the jobs will always run at least once.

Jobs may fail to run if the CronJob controller is not running or broken for a span of time from before the start time of the CronJob to start time plus startingDeadlineSeconds, or if the span covers multiple start times and concurrencyPolicy does not allow concurrency. For example, suppose a cron job is set to start at exactly 08:30:00 and its startingDeadlineSeconds is set to 10, if the CronJob controller happens to be down from 08:29:00 to 08:42:00, the job will not start. Set a longer startingDeadlineSeconds if starting later is better than not starting at all.

Higher level, solving for only-once in a distributed system is hard:

https://bravenewgeek.com/you-cannot-have-exactly-once-delivery/

Clocks and time synchronization in a distributed system is also hard:

https://8thlight.com/blog/rylan-dirksen/2013/10/04/synchronization-in-a-distributed-system.html

To the questions:

  • why this could happen?

    For instance- the node hosting the CronJobController fails at the time a job is supposed to run.

  • what are the probabilities/statistic this could happen?

    Very unlikely for any given run. For a large enough number of runs, very unlikely to escape having to face this issue.

  • will it be fixed in some reasonable future in k8s?

    There are no idemopotency-related issues under the area/batch label in the k8s repo, so one would guess not.

    https://github.com/kubernetes/kubernetes/issues?q=is%3Aopen+is%3Aissue+label%3Aarea%2Fbatch

  • are there any workarounds to prevent such a behavior (if the running job can't be implemented as idempotent)?

    Think more about the specific definition of idempotent, and the particular points in the job where there are commits. For instance, jobs can be made to support more-than-once execution if they save state to staging areas, and then there is an election process to determine whose work wins.

  • do other cron related services suffer with the same issue? Maybe it is a core cron problem?

    Yes, it's a core distributed systems problem.

    For most users, the k8s documentation gives perhaps a more precise and nuanced answer than is necessary. If your scheduled job is controlling some critical medical procedure, it's really important to plan for failure cases. If it's just doing some system cleanup, missing a scheduled run doesn't much matter. By definition, nearly all users of k8s CronJobs fall into the latter category.

Jonah Benton
  • 3,598
  • 1
  • 16
  • 27
  • 3
    looks clear, thx a lot. as for "job might be not started when job controller fails" - it was quite obvious, although why it could start multiply times was harder to understand. – radistao Feb 19 '18 at 13:25
  • 2
    I keep having multiple jobs running at one cron execution point. But it seems only if those jobs have a very short runtime. Any idea why this happens and how I can prevent it? I use concurrencyPolicy: Forbid, backoffLimit: 0 and restartPolicy: Never. – Nico Apr 02 '19 at 10:03
  • 1
    We have long running nightly jobs that get duplicated sometimes. We don't really see misses often. And we can alert and check for that easily. The duplicates are a problem, however. I am working on a solution to that. – nroose Jan 27 '20 at 23:42
  • @nroose Have you figured out the solution? – khichar.anil Mar 24 '21 at 15:45
  • @khichar.anil Not really. I have a mutex done through our MySQL to make sure the dupes don't run at the same time. These have not really been happening often any more. – nroose Mar 25 '21 at 01:12
  • 2
    In the meantime the source code has moved to https://github.com/kubernetes/kubernetes/blob/master/pkg/controller/cronjob/cronjob_controllerv2.go. The comment is not present anymore. But still the docs mention the "almost" once warning. Rarely the job does not run. Rarely the job runs twice. https://kubernetes.io/docs/concepts/workloads/controllers/cron-jobs/ – Carl in 't Veld Oct 11 '22 at 13:04