3

I'm using the K3s distribution of Kubernetes which is deployed on a Spot EC2 Instance in AWS.

I have scheduled a certain processing job and sometimes this job is being terminated and becomes in "Unknown" state (the job code is abnormally terminated)

kubectl describe pod <pod_name>

it shows this:

 State:          Terminated
      Reason:       Unknown
      Exit Code:    255
      Started:      Wed, 06 Jan 2021 21:13:29 +0000
      Finished:     Wed, 06 Jan 2021 23:33:46 +0000

The AWS logs show that the CPU consumption was 99% right before the crash. From number of sources (1, 2, 3) I saw that this can be a reason of a node crash but didn't see that one, What may be the reason?

Thanks!

sborpo
  • 928
  • 7
  • 15

1 Answers1

2

The actual state of the Job is Terminated with the Unknown reason. In order to debug this situation you need to get a relevant logs from Pods created by your Job.

When a Job completes, no more Pods are created, but the Pods are not deleted either. Keeping them around allows you to still view the logs of completed pods to check for errors, warnings, or other diagnostic output.

To do so, execute kubectl describe job $JOB to see the Pods' names under the Events section and than execute kubectl logs $POD.

If that won't be enough, you can try different ways to Debug Pods, such as:

  • Debugging with container exec

  • Debugging with an ephemeral debug container, or

  • Debugging via a shell on the node

The methods above will give you more info retarding the actual reasons behind the Job termination.

Wytrzymały Wiktor
  • 11,492
  • 5
  • 29
  • 37
  • I have called the logs command on the pod (I have the pod) but no relevant log is produced, just the regular prints in the application and they were cut off - like someone just killed to pod itself – sborpo Jan 07 '21 at 14:29
  • 1
    How was the node behaving in the moment of Job failure? Are there enough resources for it to complete it's work? Are you the only person in charge of your cluster? Maybe there are other users that could simply mess up with the Job or it's Pods? – Wytrzymały Wiktor Jan 08 '21 at 10:12
  • It's a single node that running within a VM and there are no users - what is the best way to see what was happened to the node at this time? which component of the kubernetes need to be log checked to see whats the problem? Thanks! – sborpo Jan 08 '21 at 20:46
  • This is well explained in [these docs](https://kubernetes.io/docs/tasks/debug-application-cluster/debug-cluster/#looking-at-logs). You will find the locations of the relevant log files there. – Wytrzymały Wiktor Jan 11 '21 at 09:49