After a recent upgrade to GKE 1.26, I began encountering an issue related to a Kubernetes job that has been historically running without issue.
The job itself consists two components:
- A simple
initContainer
that just functions as a health check against an API/service that can sometimes take some time to respond when spinning up (~10 minutes at times) - A script that handles logic and a variety of calls to said API service
It looks something like the following in a nutshell (some things omitted for brevity):
apiVersion: batch/v1
kind: Job
metadata:
name: my-job-{{ now | date "20060102150405" }}
labels:
app: my-job
spec:
backoffLimit: 0
template:
metadata:
labels:
app: my-job
annotations:
"cluster-autoscaler.kubernetes.io/safe-to-evict": "true"
spec:
restartPolicy: Never
...
initContainers:
- name: wait-service
...
command: ['bash', '-c', 'while [[ "$(curl -s -o /dev/null -w ''%{http_code}'' http://someService/api/v1/status)" != "200" ]]; do echo waiting for service; sleep 2s; done']
containers:
- name: run-job
...
volumes:
...
tolerations:
...
The problem I’m encountering is that after ~5 minutes after a deployment, while the initContainer
is running and awaiting the service, Kubernetes will create a new instance of the job (complete with its own initContainer
etc.) This is problematic primarily because two instances of the script being run in the primary container (run-job) could easily cause the operations within it to get out of sync/into a bad state (the script involves the suspension and restoration of various services via the API in a specific order).
I can verify this within the logs of the original job:
│ wait-service waiting for service
| failed container "run-job" in pod "my-job-20230721165715-rh6s2" is waiting to start: PodInitializing for .../my-job-20230721165715-rh6s2 (run-job)
| wait-service waiting for service
So roughly ~5 minutes after a new deployment of this job, I have two instances of it running (aligning with the failed container message above). This typically ends with one or both of them in bad states.
I’ve attempted a few configuration changes with little success and I’m wondering what the best way to handle this would be? Essentially I need to allow an adequate toleration for the initContainer
such that it doesn’t trigger the above failure and recreate a new job (but rather continue forth with the original instance).