Kubernetes livenessProbe some container stops with failure and others in success. What is the cause?

Question

Deep dive to this question. I have a scheduled cron job and a never ending container in the same pod. To end the never ending container when the cron job has done it's work I'm using a liveness probe.

apiVersion: batch/v1
kind: CronJob
metadata:
  name: pod-failed
spec:
  schedule: "*/10 * * * *"
  concurrencyPolicy: Replace
  jobTemplate:
    spec:
      ttlSecondsAfterFinished: 300
      activeDeadlineSeconds: 300
      backoffLimit: 4
      template:
        spec:
          containers:
          - name: docker-http-server
            image: katacoda/docker-http-server:latest
            ports:
            - containerPort: 80
            volumeMounts:
            - mountPath: /cache
              name: cache-volume
            volumeMounts:
            - mountPath: /cache
              name: cache-volume
            livenessProbe:
              exec:
                command:
                - sh
                - -c
                - if test -f "/cache/stop"; then exit 1; fi;
              initialDelaySeconds: 5
              periodSeconds: 5
          - name: busy
            image: busybox
            imagePullPolicy: IfNotPresent
            command:
            - sh
            - -c
            args:
            - echo start > /cache/start; sleep 15; echo stop >  /cache/stop; 
            volumeMounts:
            - mountPath: /cache
              name: cache-volume
          restartPolicy: Never
          volumes:
          - name: cache-volume
            emptyDir:
              sizeLimit: 10Mi

As you see the cron job will write the /cache/stop file and the never ending container is stopped. The problem is that with some images the never ending container stops in failure. Is there a way to stop every container in success?

Name:                     pod-failed-27827190
Namespace:                default
Selector:                 controller-uid=608efa7c-53cf-4978-9136-9fec772c1c6d
Labels:                   controller-uid=608efa7c-53cf-4978-9136-9fec772c1c6d
                          job-name=pod-failed-27827190
Annotations:              batch.kubernetes.io/job-tracking: 
Controlled By:            CronJob/pod-failed
Parallelism:              1
Completions:              1
Completion Mode:          NonIndexed
Start Time:               Mon, 28 Nov 2022 11:30:00 +0100
Active Deadline Seconds:  300s
Pods Statuses:            0 Active (0 Ready) / 0 Succeeded / 5 Failed
Pod Template:
  Labels:  controller-uid=608efa7c-53cf-4978-9136-9fec772c1c6d
           job-name=pod-failed-27827190
  Containers:
   docker-http-server:
    Image:        katacoda/docker-http-server:latest
    Port:         80/TCP
    Host Port:    0/TCP
    Liveness:     exec [sh -c if test -f "/cache/stop"; then exit 1; fi;] delay=5s timeout=1s period=5s #success=1 #failure=3
    Environment:  <none>
    Mounts:
      /cache from cache-volume (rw)
   busy:
    Image:      busybox
    Port:       <none>
    Host Port:  <none>
    Command:
      sh
      -c
    Args:
      echo start > /cache/start; sleep 15; echo stop >  /cache/stop;
    Environment:  <none>
    Mounts:
      /cache from cache-volume (rw)
  Volumes:
   cache-volume:
    Type:       EmptyDir (a temporary directory that shares a pod's lifetime)
    Medium:     
    SizeLimit:  10Mi
Events:
  Type     Reason                Age   From            Message
  ----     ------                ----  ----            -------
  Normal   SuccessfulCreate      2m5s  job-controller  Created pod: pod-failed-27827190-8tqxk
  Normal   SuccessfulCreate      102s  job-controller  Created pod: pod-failed-27827190-4gj2s
  Normal   SuccessfulCreate      79s   job-controller  Created pod: pod-failed-27827190-5wgfg
  Normal   SuccessfulCreate      56s   job-controller  Created pod: pod-failed-27827190-lzv8k
  Normal   SuccessfulCreate      33s   job-controller  Created pod: pod-failed-27827190-fr8v5
  Warning  BackoffLimitExceeded  9s    job-controller  Job has reached the specified backoff limit

As you can see the image: katacoda/docker-http-server:latest is failing with the liveness probe. This doesn't happens with ngix for example.

apiVersion: batch/v1
kind: CronJob
metadata:
  name: pod-failed
spec:
  schedule: "*/10 * * * *"
  concurrencyPolicy: Replace
  jobTemplate:
    spec:
      ttlSecondsAfterFinished: 300
      activeDeadlineSeconds: 300
      backoffLimit: 4
      template:
        spec:
          containers:
          - name: nginx
            image: nginx
            ports:
            - containerPort: 80
            volumeMounts:
            - mountPath: /cache
              name: cache-volume
            volumeMounts:
            - mountPath: /cache
              name: cache-volume
            livenessProbe:
              exec:
                command:
                - sh
                - -c
                - if test -f "/cache/stop"; then exit 1; fi;
              initialDelaySeconds: 5
              periodSeconds: 5
          - name: busy
            image: busybox
            imagePullPolicy: IfNotPresent
            command:
            - sh
            - -c
            args:
            - echo start > /cache/start; sleep 15; echo stop >  /cache/stop; 
            volumeMounts:
            - mountPath: /cache
              name: cache-volume
          restartPolicy: Never
          volumes:
          - name: cache-volume
            emptyDir:
              sizeLimit: 10Mi

Of course the never ending image that I'm pulling is ending in failure and I've no control over that image. Is there a way to force success status of the job/pod?

You might want to set terminationGracePeriodSeconds:, see [Kubernetes best practices: terminating with grace](https://cloud.google.com/blog/products/containers-kubernetes/kubernetes-best-practices-terminating-with-grace?hl=en) — Sascha Doerdelmann, Nov 28 '22 at 13:34

score 1 · Answer 1 · answered Nov 28 '22 at 13:12

1

It depends on the exit code of the container's main process. Every container receives a term signal when kubernetes wants to stop it to give it the chance to end gracefully. This also applies when the reason is a failed liveness probe. I guess nginx exits with exit code 0 while your katacode http server returns with a code different to 0. Looking at the docs of the golang ListenAndServe method it clearly states that it ends with a non-nil error: https://pkg.go.dev/net/http#Server.ListenAndServe

You could override the container's default command with a bash script that starts the application and then waits until the stop file is written:

containers:
  - name: docker-http-server
    image: katacoda/docker-http-server:latest
    command:
      - "sh"
      - "-c"
      - "/app & while true; do if [ -f /cache/stop ]; then exit 0; fi; sleep 1; done;"

Here, "/app" is the start command of the katacode http server container.

answered Nov 28 '22 at 13:12

user2311578

798
3
7

You can use "docker inspect " to find out cmd and entrypoint of an image. – user2311578 Nov 28 '22 at 13:45
Not working for me `Pods Statuses: 1 Active (0 Ready) / 0 Succeeded / 4 Failed`, `Warning BackoffLimitExceeded 0s job-controller Job has reached the specified backoff limit` – Pp88 Nov 28 '22 at 14:18
Can you examine the state of the container with "kubectl describe "? What is its state, reason and exit code? Also the logs of the container would be of interrest: "kubectl logs -c " – user2311578 Nov 28 '22 at 14:57
This is what i see in the logs Defaulted container `docker-http-server" out of: docker-http-server, busy` – Pp88 Nov 28 '22 at 20:57

Kubernetes livenessProbe some container stops with failure and others in success. What is the cause?

1 Answers1