0

While running a job on a Kubernetes cluster on GKE, I've noticed the following behavior (this is the output of kubectl get jobs --watch:

f76b2146-c302-4d0e-94a7-9299675bdc5b         2/8           4h4m       4h4m
f76b2146-c302-4d0e-94a7-9299675bdc5b         3/8           4h10m      4h10m
f76b2146-c302-4d0e-94a7-9299675bdc5b         2/8           4h11m      4h11m
f76b2146-c302-4d0e-94a7-9299675bdc5b         1/8           4h11m      4h11m
f76b2146-c302-4d0e-94a7-9299675bdc5b         2/8           4h18m      4h18m
f76b2146-c302-4d0e-94a7-9299675bdc5b         3/8           4h18m      4h18m
f76b2146-c302-4d0e-94a7-9299675bdc5b         2/8           4h18m      4h18m
f76b2146-c302-4d0e-94a7-9299675bdc5b         1/8           4h19m      4h19m
f76b2146-c302-4d0e-94a7-9299675bdc5b         0/8           4h21m      4h21m
f76b2146-c302-4d0e-94a7-9299675bdc5b         1/8           4h22m      4h22m
f76b2146-c302-4d0e-94a7-9299675bdc5b         2/8           4h22m      4h22m
f76b2146-c302-4d0e-94a7-9299675bdc5b         1/8           4h22m      4h22m
f76b2146-c302-4d0e-94a7-9299675bdc5b         2/8           4h23m      4h23m
f76b2146-c302-4d0e-94a7-9299675bdc5b         1/8           4h29m      4h29m
f76b2146-c302-4d0e-94a7-9299675bdc5b         2/8           4h30m      4h30m
f76b2146-c302-4d0e-94a7-9299675bdc5b         1/8           4h30m      4h30m
f76b2146-c302-4d0e-94a7-9299675bdc5b         2/8           4h31m      4h31m
f76b2146-c302-4d0e-94a7-9299675bdc5b         1/8           4h31m      4h31m
f76b2146-c302-4d0e-94a7-9299675bdc5b         2/8           4h35m      4h35m
f76b2146-c302-4d0e-94a7-9299675bdc5b         1/8           4h36m      4h36m
f76b2146-c302-4d0e-94a7-9299675bdc5b         2/8           4h38m      4h38m
f76b2146-c302-4d0e-94a7-9299675bdc5b         1/8           4h38m      4h38m
f76b2146-c302-4d0e-94a7-9299675bdc5b         2/8           4h40m      4h40m
f76b2146-c302-4d0e-94a7-9299675bdc5b         1/8           4h40m      4h40m
f76b2146-c302-4d0e-94a7-9299675bdc5b         2/8           4h43m      4h43m
f76b2146-c302-4d0e-94a7-9299675bdc5b         3/8           4h46m      4h46m
f76b2146-c302-4d0e-94a7-9299675bdc5b         2/8           4h47m      4h47m
f76b2146-c302-4d0e-94a7-9299675bdc5b         1/8           4h47m      4h47m
f76b2146-c302-4d0e-94a7-9299675bdc5b         0/8           4h49m      4h49m
f76b2146-c302-4d0e-94a7-9299675bdc5b         1/8           4h52m      4h52m
f76b2146-c302-4d0e-94a7-9299675bdc5b         2/8           4h56m      4h56m
f76b2146-c302-4d0e-94a7-9299675bdc5b         1/8           4h56m      4h56m
f76b2146-c302-4d0e-94a7-9299675bdc5b         2/8           4h58m      4h58m
f76b2146-c302-4d0e-94a7-9299675bdc5b         1/8           4h58m      4h58m
f76b2146-c302-4d0e-94a7-9299675bdc5b         2/8           5h         5h
f76b2146-c302-4d0e-94a7-9299675bdc5b         1/8           5h         5h

As you can see, the "succeeded" field's value goes both up and down. This can also be seen on the full yaml one receives from the API (leaving only status field):

status:
  active: 7
  startTime: "2021-02-02T06:42:53Z"
  succeeded: 1

After reading carefully the documentation, I have seen no reference to such behavior. Here's the full job spec:

apiVersion: batch/v1
kind: Job
metadata:
  creationTimestamp: "2021-02-02T06:42:53Z"
  labels:
    controller-uid: 575af5b7-0c98-4470-a75f-d552810a2887
    job-name: f76b2146-c302-4d0e-94a7-9299675bdc5b
  name: f76b2146-c302-4d0e-94a7-9299675bdc5b
  namespace: default
  resourceVersion: "204488525"
  selfLink: /apis/batch/v1/namespaces/default/jobs/f76b2146-c302-4d0e-94a7-9299675bdc5b
  uid: 575af5b7-0c98-4470-a75f-d552810a2887
spec:
  backoffLimit: 6
  completions: 8
  parallelism: 10
  selector:
    matchLabels:
      controller-uid: 575af5b7-0c98-4470-a75f-d552810a2887
  template:
    metadata:
      creationTimestamp: null
      labels:
        controller-uid: 575af5b7-0c98-4470-a75f-d552810a2887
        job-name: f76b2146-c302-4d0e-94a7-9299675bdc5b
    spec:
      containers:
      - command:
        - node
        - --max-old-space-size=24576
        - dist/src/main.js
        - --queue
        - f76b2146-c302-4d0e-94a7-9299675bdc5b
        env:
        - name: CONSUME_WAIT_FOR_MESSAGE_TIMEOUT_SEC
          value: "300"
        - name: GOOGLE_APPLICATION_CREDENTIALS
          value: /var/secrets/google/key.json
        - name: RUN_ENVIRONMENT
          value: development
        image: gcr.io/ds-research-1/scan-worker:8def43ab927aa4d23ad0e8dafac4eb16180dae66
        imagePullPolicy: IfNotPresent
        name: scan-container
        resources:
          limits:
            cpu: 2500m
            memory: 2Gi
          requests:
            cpu: 1600m
            memory: 1Gi
        terminationMessagePath: /dev/termination-log
        terminationMessagePolicy: File
        volumeMounts:
        - mountPath: /var/secrets/google
          name: google-cloud-key
      dnsPolicy: ClusterFirst
      nodeSelector:
        cloud.google.com/gke-nodepool: scans-pool
      restartPolicy: OnFailure
      schedulerName: default-scheduler
      securityContext: {}
      terminationGracePeriodSeconds: 30
      volumes:
      - name: google-cloud-key
        secret:
          defaultMode: 420
          secretName: scan-service-app-poc-df0e7546cca02ed6fa492f01ea166160f2190e94
DorHugi
  • 142
  • 10
  • 1
    what does kubectl describe pod outputs? are you sure the job does an exit code 0? – paltaa Feb 02 '21 at 12:45
  • In this yaml file, each pod request 1G memory resources of cluster for a job, and value of parallelism is set to 10, so 10 pods are created at meantime, you can check job status run `kubectl describe job "JOBNAME" `, you may find out why job was failure and can't complete 8 times. – William Feb 02 '21 at 17:51
  • @William according to the [docs](https://kubernetes.io/docs/concepts/workloads/controllers/job/): `For fixed completion count Jobs, the actual number of pods running in parallel will not exceed the number of remaining completions. Higher values of .spec.parallelism are effectively ignored.` So I think that only 8 will be created, as expect4ed – DorHugi Feb 03 '21 at 08:18
  • @paltaa I am certain that they exit with exit code 0, otherwise, we should have seen "backOffLimitExceeded" due to multiple failures. What do you mean by `kubectl describe pod outputs`? – DorHugi Feb 03 '21 at 08:20
  • @DorHugi, you are right, and you will find the status of pod or job and error message if pod or job failed via run `kubectl describe pod "PODNAME"` or `kubectl describe job "JOBNAME"`. – William Feb 03 '21 at 15:15

0 Answers0