k8s job - keep a failed pod

Question

Below is my definition for a k8s job (to convert a column of a mysql table from int->bigint using Percona's pt-online-schema-change):

apiVersion: batch/v1
kind: Job
metadata:
  name: bigint-tablename-columnname
  namespace: prod
spec:
  backoffLimit: 0
  template:
    metadata:
      name: convert-int-to-bigint-
    spec:
      containers:
      - name: percona
    image: perconalab/percona-toolkit:3.2.1
    command: [
      "/bin/bash",
      "-c",
      "pt-online-schema-change --host=dbhost --user=dbuser --password=dbpassword D=dbname,t=tablename --alter \"MODIFY COLUMN columnname BIGINT\" --alter-foreign-keys-method \"rebuild_constraints\" --nocheck-foreign-keys --execute"
    ]
    env:
      - name: SYMFONY__POD_NAMESPACE
        valueFrom:
          fieldRef:
            fieldPath: metadata.namespace
      restartPolicy: Never

I've experienced that the pod failed for some reason - in a kubectl describe job jobname I see Pods Statuses: 0 Active (0 Ready) / 0 Succeeded / 1 Failed. However in kubectl get pods there is no pod associated with the job so I cannot view the pod logs to find out why it failed.

I thought using restartPolicy: Never should keep the pod around as per 1, 2, but clearly my understanding isn't correct. So how do I ensure that if this process fails then the pod is still kept for me to inspect?

Does `kubectl describe job` say anything? There are circumstances where it turns out to be impossible to create the Pod and that might result in this state. — David Maze, Aug 23 '23 at 11:07
Thanks for the reply, it wasn't any creation/startup problem as the job was a long-running one that had been going for over a week at the time it failed. Anything in particular I should be looking for in `kubectl describe job`? There's nothing that stands out as relevant besides `Pods Statuses: 0 Active (0 Ready) / 0 Succeeded / 1 Failed` — callum, Aug 23 '23 at 11:21
Does [completions and parallelism](https://levelup.gitconnected.com/understanding-jobs-in-kubernetes-68ac21b272d8#:~:text=completions%20%26%20parallelism) help to resolve your issue?, add the whole output of `kubectl describe job` by editing the question. — Sai Chandra Gadde, Aug 23 '23 at 11:33
In this case no, because in fact this job I very specifically don't want to run more than once - if it fails I need to do some manual cleanup before retrying, so it's important a job only ever creates one pod. I've cleaned up the job now though I'm afraid, what further info could it have contained? — callum, Aug 23 '23 at 14:40

score 0 · Answer 1 · answered Aug 23 '23 at 12:55

0

If the pod fails or terminates, you won't be able to get the logs. This is because logs are only fetched for existing resources.

One way to do so is to continuously save your logs while your pod is alive. There are different strategies to do so that you can find in the documentation. Using a logging backend is one of them.

https://kubernetes.io/docs/concepts/cluster-administration/logging/

answered Aug 23 '23 at 12:55

aboitier

180
11

I don't think this can be quite right - I have an an unrelated job whose pod also failed, but if I do `kubectl get pods` that pod is still sitting there with a status of `Error` and I can view the logs from it, so that one did not get terminated by k8s. So a failed/errored pod is not always removed by k8s, I just don't know under what conditions it is – callum Aug 23 '23 at 14:38
Maybe you should indent "restartPolicy" within spec, and not within env. – aboitier Aug 24 '23 at 11:22

k8s job - keep a failed pod

1 Answers1