4

ATT: I dont know why, but sometimes one pod suddenly change status to Unknown, that's where the new pod starts

I'm using kubernetes in gcloud.

I built the yaml file for the cron job that I need to run:

apiVersion: batch/v1beta1
kind: CronJob
metadata: 
  name: etl-table-feed-from-schema-vtex-to-schema-sale-all
spec:
  schedule: "* * * * *"
  concurrencyPolicy: "Forbid"
  failedJobsHistoryLimit: 3
  successfulJobsHistoryLimit: 1
  startingDeadlineSeconds: 60 # 1 min
  jobTemplate:
    spec:
      backoffLimit: 0
      #activeDeadlineSeconds: 3600 # 1 hora
      template:
        spec:
          containers:
            - name: etl-table-feed-from-schema-vtex-to-schema-sale-all
              image: (myimage)
              command: ["/bin/sh", "-c"]
              args: (mycommands)
              env:
              - name: PYTHONUNBUFFERED
                value: "1"
              envFrom:
              - secretRef:
                  name: etl-secret
          restartPolicy: Never
          nodeSelector:
        #<labelname>:value
            etlnode: etl-hi-cpu

I need just one pod running at a time, just one. But sometimes, and I don't know why, and I can't reproduce, more than one pod run at a time.

I've already write the concurrencyPolicy as Forbid, but seems that it's not enough.

I run this in a preemptive pool in gcloud.

Two pods that ran at the same time:

enter image description hereenter image description here

  • What with Pods/Jobs status? They were successful completed or there was any error? What GKE version are you using? – PjoterS Jun 24 '20 at 15:00
  • could you check you dont have 2 equal cronjobs ? – paltaa Jun 24 '20 at 15:46
  • They don't appear in kubectl get pods command after they stop running. I can only se these pods in the audit log in gcloud. There i saw that there were two pods running at a time. There is only one cronjob. I think that this is the version: 1.14.10-gke.36 – Matheus Epifanio Jun 24 '20 at 21:05
  • I have the same use case, a job running every minute. In my case it seems when one job is marked as DeadlineExceeded the next one starts immediately, before the previous pod is terminated... – acristu Sep 03 '20 at 07:42
  • If the job is running every minute, it's not a cronjob. It's a daemon. Set it up as such. – SineSwiper Jan 26 '22 at 19:11

2 Answers2

3

In my case the problem is that concurrencyPolicy: "Forbid" and activeDeadlineSeconds are not enough. My previous pod receives SIGTERM but keeps running for another 30sec (terminationGracePeriodSeconds) before it is actually killed, so I end up with two jobs running in parallel for 30 sec.

See this question: Kubernetes Cron Job Terminate Pod before creation of next schedule, in my case this answer provides the solution: https://stackoverflow.com/a/63721120/5868044. Two options:

  1. make the pod stop immediatelly on SIGTERM (e.g. with bash trap 'exit' SIGTERM)
  2. leave a 30+ sec window between your jobs by setting a smaller activeDeadlineSeconds than the schedule interval.
acristu
  • 721
  • 6
  • 19
2

You have set schedule: "* * * * *" which means, job will be create each minute.

concurrencyPolicy: "Forbid" is working as described.

The cron job does not allow concurrent runs; if it is time for a new job run and the previous job run hasn't finished yet, the cron job skips the new job run

Meaning, it will not allow to create new job if there will be still Unfinished job. If the job was finished, then concurrencyPolicy will allow to create another one. It will not allow to run 2 jobs which are Unfinished.

activeDeadlineSeconds: as per Kubernetes docs

The activeDeadlineSeconds applies to the duration of the job, no matter how many Pods are created. Once a Job reaches activeDeadlineSeconds, all of its running Pods are terminated and the Job status will become type: Failed with reason: DeadlineExceeded.

Also as mentioned in Jobs cleanup policy.

If the Jobs are managed directly by a higher level controller, such as CronJobs, the Jobs can be cleaned up by CronJobs based on the specified capacity-based cleanup policy.

To test I've used busybox with sleep 20 command as I don't know exactly what your job is doing.

Meaning, if you keep your default settings

spec:
  failedJobsHistoryLimit: 3
  successfulJobsHistoryLimit: 1

It will keep successful job till the next one will be created and will keep it for a while if you would like to check logs etc.

$ kubectl get cronjob,job,pod
NAME                                                               SCHEDULE    SUSPEND   ACTIVE   LAST SCHEDULE   AGE
cronjob.batch/etl-table-feed-from-schema-vtex-to-schema-sale-all   * * * * *   False     1        17s             51s

NAME                                                                      COMPLETIONS   DURATION   AGE
job.batch/etl-table-feed-from-schema-vtex-to-schema-sale-all-1593018780   0/1           14s        14s

NAME                                                                  READY   STATUS    RESTARTS   AGE
pod/etl-table-feed-from-schema-vtex-to-schema-sale-all-1593018h9pnh   1/1     Running   0          13s
---
$ kubectl get cronjob,job,pod
NAME                                                               SCHEDULE    SUSPEND   ACTIVE   LAST SCHEDULE   AGE
cronjob.batch/etl-table-feed-from-schema-vtex-to-schema-sale-all   * * * * *   False     1        33s             2m7s

NAME                                                                      COMPLETIONS   DURATION   AGE
job.batch/etl-table-feed-from-schema-vtex-to-schema-sale-all-1593018780   1/1           23s        90s
job.batch/etl-table-feed-from-schema-vtex-to-schema-sale-all-1593018840   1/1           21s        29s

NAME                                                                  READY   STATUS      RESTARTS   AGE
pod/etl-table-feed-from-schema-vtex-to-schema-sale-all-1593018h9pnh   0/1     Completed   0          89s
pod/etl-table-feed-from-schema-vtex-to-schema-sale-all-1593018k7b58   0/1     Completed   0          29s
---
$ kubectl get cronjob,job,pod
NAME                                                               SCHEDULE    SUSPEND   ACTIVE   LAST SCHEDULE   AGE
cronjob.batch/etl-table-feed-from-schema-vtex-to-schema-sale-all   * * * * *   False     0        34s             2m8s

NAME                                                                      COMPLETIONS   DURATION   AGE
job.batch/etl-table-feed-from-schema-vtex-to-schema-sale-all-1593018840   1/1           21s        30s

NAME                                                                  READY   STATUS      RESTARTS   AGE
pod/etl-table-feed-from-schema-vtex-to-schema-sale-all-1593018k7b58   0/1     Completed   0          30s

However if you will set successfulJobsHistoryLimit to 0 it will remove job after a while, even before next scheduled job.

spec:
  failedJobsHistoryLimit: 3
  successfulJobsHistoryLimit: 0

Output:

$ kubectl get cronjob,job,pod
NAME                                                               SCHEDULE    SUSPEND   ACTIVE   LAST SCHEDULE   AGE
cronjob.batch/etl-table-feed-from-schema-vtex-to-schema-sale-all   * * * * *   False     1        18s             31s

NAME                                                                      COMPLETIONS   DURATION   AGE
job.batch/etl-table-feed-from-schema-vtex-to-schema-sale-all-1593018540   0/1           15s        15s

NAME                                                                  READY   STATUS    RESTARTS   AGE
pod/etl-table-feed-from-schema-vtex-to-schema-sale-all-15930182r5bn   1/1     Running   0          15s
---
$ kubectl get cronjob,job,pod
NAME                                                               SCHEDULE    SUSPEND   ACTIVE   LAST SCHEDULE   AGE
cronjob.batch/etl-table-feed-from-schema-vtex-to-schema-sale-all   * * * * *   False     1        31s             44s

NAME                                                                      COMPLETIONS   DURATION   AGE
job.batch/etl-table-feed-from-schema-vtex-to-schema-sale-all-1593018540   1/1           22s        28s

NAME                                                                  READY   STATUS      RESTARTS   AGE
pod/etl-table-feed-from-schema-vtex-to-schema-sale-all-15930182r5bn   0/1     Completed   0          28s
---
$ kubectl get cronjob,job,pod
NAME                                                               SCHEDULE    SUSPEND   ACTIVE   LAST SCHEDULE   AGE
cronjob.batch/etl-table-feed-from-schema-vtex-to-schema-sale-all   * * * * *   False     0        34s             47s

This time also depends on job duration.

Also if the job completed successfully (exit code 0), then pod will change status to completed and it will no longer use cpu/memory resurces.

You can also read about TTL Mechanism, but unfortunately I don't think it would work here as Master is managed by google and this feature would require to add some flags in Kubelet Feature Gates.

PjoterS
  • 12,841
  • 1
  • 22
  • 54