Handling Long-Running initContainers for Kubernetes Jobs

Question

After a recent upgrade to GKE 1.26, I began encountering an issue related to a Kubernetes job that has been historically running without issue.

The job itself consists two components:

A simple initContainer that just functions as a health check against an API/service that can sometimes take some time to respond when spinning up (~10 minutes at times)
A script that handles logic and a variety of calls to said API service

It looks something like the following in a nutshell (some things omitted for brevity):

apiVersion: batch/v1
kind: Job
metadata:
  name: my-job-{{ now | date "20060102150405" }}
  labels:
    app: my-job
spec:
  backoffLimit: 0
  template:
    metadata:
      labels:
        app: my-job
      annotations:
        "cluster-autoscaler.kubernetes.io/safe-to-evict": "true"
    spec:
      restartPolicy: Never
      ...
      initContainers:
      - name: wait-service
        ...
        command: ['bash', '-c', 'while [[ "$(curl -s -o /dev/null -w ''%{http_code}'' http://someService/api/v1/status)" != "200" ]]; do echo waiting for service; sleep 2s; done']
      containers:
        - name: run-job
          ...
      volumes:
          ...
      tolerations: 
          ...

The problem I’m encountering is that after ~5 minutes after a deployment, while the initContainer is running and awaiting the service, Kubernetes will create a new instance of the job (complete with its own initContainer etc.) This is problematic primarily because two instances of the script being run in the primary container (run-job) could easily cause the operations within it to get out of sync/into a bad state (the script involves the suspension and restoration of various services via the API in a specific order).

I can verify this within the logs of the original job:

│ wait-service waiting for service 
| failed container "run-job" in pod "my-job-20230721165715-rh6s2" is waiting to start: PodInitializing for .../my-job-20230721165715-rh6s2 (run-job)                            
| wait-service waiting for service

So roughly ~5 minutes after a new deployment of this job, I have two instances of it running (aligning with the failed container message above). This typically ends with one or both of them in bad states.

I’ve attempted a few configuration changes with little success and I’m wondering what the best way to handle this would be? Essentially I need to allow an adequate toleration for the initContainer such that it doesn’t trigger the above failure and recreate a new job (but rather continue forth with the original instance).

See if there's anything in the events for the job `kubectl describe job -n {namespace} {jobname}` (the events auto purge though, so check this as soon as possible after the second job comes up). Also, does the job get spawned from a cronjob? — Blender Fox, Jul 22 '23 at 05:39
The only thing that I’ve seen is the failed container “run-job” message in the logs that is triggered at the 5 minute mark (which results in the creation of the second job). I’ve verified that the service for the initContainer still isn’t ready at that point as well, but just need to tell the job to wait longer before trying again with a new pod. The job itself is part of a deployment that goes out on a regular basis and not spawned from a cronjob. — Rion Williams, Jul 22 '23 at 12:27

Blender Fox · Accepted Answer · 2023-07-23T17:12:14.790

1

Since you're using helm, and you've named the job using timestamped name (my-job-{{ now | date "20060102150405" }}), this will create a fresh job each time you do the helm install, but this makes no connection with the existing job(s) that may or may not be running at the time you do the upgrade.

If you want to ensure existing jobs are terminated when you deploy, you should consider using pre-upgrade hooks to delete any existing jobs in the application namespace before the upgrade is applied.

UPDATE 1

I've spun up a 1.26 cluster and used your example (with a few tweaks in order to get it to run), left it for 10 minutes, and got no additional job or pods.

What you can do in the meanwhile however, is trace the pods backwards to find out what "owns" them. If you kubectl describe {pod}, you'll see within the output a line reading "Controlled by". For example:

Controlled By:  Job/example-service-deploy-jobs-20230722170514

If you see two pods, describe both and see if the same job is referenced or not. If you have both pointing at the same job, then the job has spawned two pods -- this normally means it considered the first pod as failed and has spawned the second to try again.

If you see a different job referenced, it means another job has been deployed without deleting the first one.

Describe the jobs and see they it also have a "controlled by" field (they shouldn't if they were installed by Helm or manually deployed using kubectl apply or similar) -- my reason for this check is to see if something (like a cronjob) is triggering a job.

Separate question: how is your cluster being hosted, is it bare metal or hosted (AKS, EKS, GKE, etc?)

Another possibility, if you're running on hosted is that you're running on Spot/Preemptible instances, or the node is having some other issue. You can watch the nodes (watch kubectl get nodes) to see if any of them terminate while you're watching the init container -- and if they do, you can start investigating the reason for the node termination.

In short, it is not the job itself that is the issue, but something else around it (or in the cluster).

edited Jul 23 '23 at 17:12

answered Jul 22 '23 at 20:26

Blender Fox

4,442
2
17
30

Thanks, I guess what I’m wondering is why the single deployment is resulting in multiple instances of the job? When it is initially deployed, I can see the first one and the second instance starts up at the ~5 minute mark (aligning with the container failed message). I’m trying to avoid the second instance spinning up at all (i.e. tolerating a longer running initContainer). – Rion Williams Jul 22 '23 at 20:57
If you are able to, can you share the entire chart (link to a git repo or similar, and redact anything you need to) – Blender Fox Jul 22 '23 at 21:05
I’ll try to get a gist together when I get a chance of the deployment, etc. – Rion Williams Jul 22 '23 at 22:38
[This is an example](https://gist.github.com/rionmonster/cb71b8f34ed8bc1b4f048a3ec6b07a11) of what the YAML for the deployed Helm looks like. I'm not sure just how helpful it is as it's heavily redacted but the key one in question would be the `example-service-deploy-jobs` which is the one that has the long running initContainer (`wait-service`) that causes the inner container (`run-job`) to fail after 5 minutes and spawns a new job. – Rion Williams Jul 23 '23 at 00:57
Can you share the helm chart itself, not just the rendered yaml? – Blender Fox Jul 23 '23 at 07:17
Also I noticed you have "restarrtPolicy: failure` this does cause problems when debugging (see https://stackoverflow.com/questions/75974501/kubernetes-finding-logs-of-a-failed-cronjob/75975700#75975700) -- namely, this causes failed job pods be hidden. Try turning this to "Never" -- you'll get some failed pods instead of them restarting instead, but then you can debug the exact reason for those failures. – Blender Fox Jul 23 '23 at 07:52
Finally, I used your sample job yaml in my own kind cluster, tweaked it a little to point to google.com instead and expect a 200 instead of a not 200 (to similuate a service not being ready yet) -- I ran it for 10 minutes both with `restartPolicy` set as `Never` and `OnFailure`, and never got a second job pod appear – Blender Fox Jul 23 '23 at 07:58
I’ll try to get an example of the chart itself. It’s also worth mentioning that this didn’t occur until a recent upgrade to GKE (I think 1.26), so I’m not sure if that’s related or not. – Rion Williams Jul 23 '23 at 12:34
It's unlikely the upgrade would have caused this as I can't see anything within the docs that relate to the `Job` but I'll see if I can spin up a 1.26 cluster to test your example – Blender Fox Jul 23 '23 at 12:44
I've [added an updated version of the templated chart to the gist](https://gist.github.com/rionmonster/cb71b8f34ed8bc1b4f048a3ec6b07a11) for just the Job since that seems to be only relevant/problematic part. I can grab some additional sections if there's specifics as well. – Rion Williams Jul 23 '23 at 15:49
Added some updates to the answer. Please take a look and let us know how you get on – Blender Fox Jul 23 '23 at 17:12
Okay, that is one thing that I noticed when the initial issue arose. The `controller-uid` for both jobs is different, however the second job only spins up after the failed container message above appears ~5 minutes after the initial job is still initializing (via init-container). Verified that always two instances are spun up (~5 minutes apart) per deployment and all I can think of is if there’s an easy way for the job itself to not fail after the long running initContainer. I feel like if that can stop, it might stop the secondary job from showing up. This is all running in GKE. – Rion Williams Jul 23 '23 at 17:53
Is the "controlled by" value empty for both jobs? Because that would imply something is deploying another job. Perhaps a CI/CD pipeline? – Blender Fox Jul 23 '23 at 18:00
Nope, if I remember correctly it was a GUID/UUID (different for each). What’s curious is, I was under the assumption that the primary container wouldn’t even be attempted to run until after the initContainer had completed/exited successfully. So to see the “run-job” one fail while the initContainer is running (per the logs) is a bit unexpected. – Rion Williams Jul 23 '23 at 18:21
I managed to find the specific ones with the same name (sans timestamps) and differing selectors / controller values: `Name: example-deploy-jobs-20230721145419, ... , Selector: controller-uid=60756759-9360-4092-beb6-423408414168` and `Name: example-deploy-jobs-20230721145955, ..., Selector: controller-uid=da8c2f9c-974e-4682-ab22-d13198a836fe` – Rion Williams Jul 23 '23 at 21:31
I also managed to go take a look at the pods associated with the jobs, I'm wondering if this is the culprit? `Tolerations: node.kubernetes.io/not-ready:NoExecute op=Exists for 300s,node.kubernetes.io/unreachable:NoExecute op=Exists for 300s` They seem to be defaults in this specific environment that aren't being overridden. Not sure why it would cause the underlying container to fail (even though the initContainer is running), but maybe these need to be increased for this particular job? – Rion Williams Jul 23 '23 at 21:38
The tolerations just allow the pod to be scheduled on nodes which otherwise would not take the pods. Re your comment about the init container running while the main container is also running shouldn't be possible (unless you've hit a bug). If you `kubectl get pods` then you'll either see the status reading `Init: 0/1` if it's running the container or `Running 1/1` if it's running the main container (or some failure status if the pod has failed outright) – Blender Fox Jul 24 '23 at 05:56
That makes sense on the tolerations, thanks for that. I know the message that I’m getting is saying the main container is failing _while_ the curl/init loop of the initContainer is running, which seems odd since I don’t think that container could/would fail if it wasn’t running. – Rion Williams Jul 24 '23 at 11:52
Let us [continue this discussion in chat](https://chat.stackoverflow.com/rooms/254651/discussion-between-blender-fox-and-rion-williams). – Blender Fox Jul 25 '23 at 07:46

Handling Long-Running initContainers for Kubernetes Jobs

1 Answers1