How do I debug Kubernetes pods terminating unexpectedly in a job?

Question

I am running a Kubernetes job, where pods are terminating and being recreated multiple times, for some unknown reason. I am assuming that the pods are terminated as a result of some sort of eviction process, as the termination occurs across all pods and all jobs simultaneously. Tailing the containers logs show no indication that the container’s command or process has failed. I am looking for a way to debug what is causing the termination of these pods.

The following is an example of the job manifest I am running:

{
 "apiVersion": "batch/v1",
 "kind": "Job",
 "metadata": {
  "generateName": "job-",
  "namespace": "default"
 },
 "spec": {
  "backoffLimit": 0,
  "template": {
   "spec": {
    "containers": [
     {
      "command": [
       "/bin/sh"
      ],
      "image": "******",
      "name": "x",
      "resources": {
       "limits": {
        "cpu": 2,
        "memory": "4G"
       },
       "requests": {
        "cpu": 2,
        "memory": "4G"
       }
      }
     }
    ],
    "restartPolicy": "Never"
   }
  },
  "ttlSecondsAfterFinished": 600
 }
}

I would like to use kubectl describe pod and kubectl logs to identify what caused the pods to be terminated. However, immediately upon termination, the pod is deleted and cannot be inspected using the above commands.

I have inspected kubectl get events to try to determine the reason for the pod being terminated. However, the output gives little information:

5m16s       Normal    Created                pod/job-q4v5l-vxtgg   Created container x
5m15s       Normal    Started                pod/job-q4v5l-vxtgg   Started container x
5m15s       Normal    Killing                pod/job-q4v5l-vxtgg   Stopping container x

The kubectl describe job command shows the following events. As can be seen from this output, a pod is repeatedly created.

Events:
  Type    Reason            Age                     From            Message
  ----    ------            ----                    ----            -------
  Normal  SuccessfulCreate  6m38s                   job-controller  Created pod: job-q4v5l-7trcd
  Normal  SuccessfulCreate  6m34s                   job-controller  Created pod: job-q4v5l-zzw27
  Normal  SuccessfulCreate  6m33s                   job-controller  Created pod: job-q4v5l-4crzq
  Normal  SuccessfulCreate  6m31s                   job-controller  Created pod: job-q4v5l-sjbdh
  Normal  SuccessfulCreate  6m28s                   job-controller  Created pod: job-q4v5l-fhz2x
  Normal  SuccessfulCreate  6m25s                   job-controller  Created pod: job-q4v5l-6vgg5
  Normal  SuccessfulCreate  6m22s                   job-controller  Created pod: job-q4v5l-7dmh4
  Normal  SuccessfulCreate  6m19s                   job-controller  Created pod: job-q4v5l-klf4q
  Normal  SuccessfulCreate  6m15s                   job-controller  Created pod: job-q4v5l-87vwx
  Normal  SuccessfulCreate  5m32s (x16 over 6m12s)  job-controller  (combined from similar events): Created pod: job-q4v5l-6x5pv

Try running the job as a pod, and see if it crashes in the same way, then you can interact directly with the pod to get logs, etc. — Blender Fox, Mar 14 '23 at 20:09
There's a [section on the documentation](https://kubernetes.io/docs/concepts/workloads/controllers/job/#handling-pod-and-container-failures) on that. — ollaw, Mar 14 '23 at 23:46
@BlenderFox, thanks for your suggestion. Unfortunately, I have the same issue with just pods. Following termination `kubectl describe pod podname` and `kubectl logs podname` returns that the pod is not found. — Eddie Aspden, Apr 21 '23 at 12:35

score 0 · Answer 1 · edited Apr 13 '23 at 08:22

As explained in the blog by Shahar Azulay:

There are many reasons why Pods could end up in the Failed state due to unsuccessful container termination. Common root causes include failure to pull the container image because it’s unavailable, bugs in application code or misconfigurations in the Pod’s YAML. But simply knowing that a Pod has failed doesn’t mean you’ll know the cause of failure. Unless you dig deeper, the only thing that you’ll know is that it is in the Failed state.

One way to dig deeper is to look at container exit codes. Container exit codes are numeric codes that give a nominal reason for why a The container stopped working. You can get the exit code for containers in a Pod by running

kubectl get pod termination-demo

Refer this doc for more information about reasons for pod failure and this doc for debugging pods.

Thanks for your answer. However, `kubectl get pod` has the same issue as `kubectl describe pod`. Following the pod terminating, these commands all return `pods not found`. — Eddie Aspden, Apr 21 '23 at 12:37

score 0 · Answer 2 · answered Apr 21 '23 at 13:13

I tweaked your yaml, substituting in busybox instead, to simulate what you're doing:

{
 "apiVersion": "batch/v1",
 "kind": "Job",
 "metadata": {
  "generateName": "job-",
  "namespace": "default"
 },
 "spec": {
  "backoffLimit": 0,
  "template": {
   "spec": {
    "containers": [
     {
      "command": [
       "/bin/sh"
      ],
      "image": "busybox",
      "name": "x",
      "resources": {
       "limits": {
        "cpu": 2,
        "memory": "4G"
       },
       "requests": {
        "cpu": 2,
        "memory": "4G"
       }
      }
     }
    ],
    "restartPolicy": "Never"
   }
  },
  "ttlSecondsAfterFinished": 600
 }
}

This created one pod and exited successfully

$ kubectl get pods -n default
NAME              READY   STATUS      RESTARTS   AGE
job-vn8mc-jnpzz   0/1     Completed   0          3m34s

I did not get any pods disappearing like you indicated.

My kubectl describe job:

Events:
  Type    Reason            Age    From            Message
  ----    ------            ----   ----            -------
  Normal  SuccessfulCreate  4m49s  job-controller  Created pod: job-vn8mc-jnpzz
  Normal  Completed         3m8s   job-controller  Job completed

My kubectl get events:

4m10s       Normal    Created                        pod/job-vn8mc-jnpzz                                        Created container x
4m10s       Normal    Started                        pod/job-vn8mc-jnpzz                                        Started container x
5m47s       Normal    SuccessfulCreate               job/job-vn8mc                                              Created pod: job-vn8mc-jnpzz
4m6s        Normal    Completed                      job/job-vn8mc                                              Job completed

Compare this with yours:

5m16s       Normal    Created                pod/job-q4v5l-vxtgg   Created container x
5m15s       Normal    Started                pod/job-q4v5l-vxtgg   Started container x
5m15s       Normal    Killing                pod/job-q4v5l-vxtgg   Stopping container x

What this is telling me is your job is trying to create the pod, the pod is failing to complete successfully, the job is retrying and then giving up.

So, I've converted your job into a single pod yaml:

apiVersion: v1
kind: Pod
metadata:
  name: job-as-pod
  namespace: default
spec:
  containers:
  - command:
    - /bin/sh
    image: *******
    imagePullPolicy: Always
    name: x
  restartPolicy: Never

Run this and it should create a pod job-as-pod that will either complete:

$ kubectl get pods
NAME         READY   STATUS      RESTARTS   AGE
job-as-pod   0/1     Completed   0          2m15s

or fail

$ kubectl get pods
NAME         READY   STATUS   RESTARTS   AGE
job-as-pod   0/1     Error    0          12s

I expect if you plug in your image here, it'll error. Then you can debug the exact error.

In my cluster, the pod also disappears with this manifest at certain times. I believe it’s caused by some form of pod eviction. The reason I want to be able to debug this is to work out what is causing the pod to be evicted. — Eddie Aspden, Apr 21 '23 at 20:21
I noticed you haven't specified how you're running kubernetes, are you running a managed service (Amazon, Google, Azure) or local (Minikube, kind, k3s?) — Blender Fox, Apr 21 '23 at 20:37
Also it would be very helpful if you should share the image you're using (if it isn't private or confidential), that way we can do a like-for-like. — Blender Fox, Apr 21 '23 at 20:39
It’s a local cluster with kubeadm. The main image is private, but it’s happening across all images, containers and pods simultaneously. For example, when our main job containers are terminated, our monitoring agent containers (e.g. DataDog) are also terminated — Eddie Aspden, Apr 21 '23 at 20:55
Is it running on physical machines or VMs? Do you have direct access to the nodes? If so the next step is to see which node one of your pods is landing on and then go onto the node itself and checks its kubernetes logs to see if there's anything there (such as the kubelet). Also check the master/control plane node logs as it might detail why its evicting (if evicting is the cause) — Blender Fox, Apr 22 '23 at 05:30
There's a mix of bare metal & VMs. The kubelet logs on the nodes have the following log, which seems to be relevant: `Eviction manager: failed to get summary stats`. In the kube-scheduler logs on the control plane, the following line seems to match up with the time a pod is terminated: `"Failed running Bind plugin" err="Operation cannot be fulfilled on pods/binding \"podname\": pod podname is being deleted, cannot be assigned to a host" plugin="DefaultBinder" pod="default/podname" — Eddie Aspden, Apr 24 '23 at 10:16
Can you try adding a node selector (https://kubernetes.io/docs/tasks/configure-pod-container/assign-pods-nodes/) to force it to bare metal machine, and see if this problem still exists. Then try the same but force it to the VMs? — Blender Fox, Apr 24 '23 at 10:23
It occurs at random for several hours at a time. Sometimes (like now) it is fine and I cannot replicate the problem, but on Friday it was consistent for the majority of the day. However, when it does occur, it happens on both bare metal and VMs — Eddie Aspden, Apr 24 '23 at 10:34
If you have any log shipping on the cluster e.g. via Stackdriver, or ELK, it would be worth checking that also. Finally, this issue (https://github.com/kubernetes/kubernetes/issues/101877) has the same starting message (although it is a slightly different issue). I am unable to reproduce your issue from your yaml, I did find this question: https://stackoverflow.com/questions/51638559/how-to-diagnose-kubernetes-not-responding-on-api which also has the `Eviction manager` error message. Maybe that may be of help? — Blender Fox, Apr 24 '23 at 10:50

How do I debug Kubernetes pods terminating unexpectedly in a job?

2 Answers2