What happens with memory requests/limits specified in K8s job (container) when job is completed?

Question

I have multi-environment k8s cluster (EKS) and I'm trying to setup accurate values for ResourceQuotas.

One interesting thing that I've noticed is that specified request/limit for CPU/memory stay "occupied" in k8s cluster when job is completed successfully and effectively pod releases cpu/memory resources that it is using.

Since I expect that there would be a lot of jobs executed on the environment this caused a problem for me. Of course, I've added support for running cleanup cronjob for the successfully executed jobs but that is just one part of the solution.

I'm aware of the TTL feature on k8s: https://kubernetes.io/docs/concepts/workloads/controllers/jobs-run-to-completion/#ttl-mechanism-for-finished-jobs that is still in alpha state and as such not available on the EKS k8s cluster.

I would expect that both request/limits specified on that specific pod (container/s) are "released" also but when looking at the k8s metrics on Grafana, I see that that is not true.

This is an example (green line marks current resource usage, yellow marks resource request while blue marks resource limit):

My question is:

Is this expected behaviour?
If yes, what are the technical reasons why request/limits are not released as well after job (pod) execution is completed?

You need to provide more information. What k8s version are you using, its on Cloud env? Those jobs are parallel? Did you check this docs: https://kubernetes.io/docs/concepts/workloads/controllers/jobs-run-to-completion/#job-termination-and-cleanup — PjoterS, Mar 06 '20 at 18:07

score 3 · Answer 1 · answered Mar 07 '20 at 20:44

I've done "load" test on my environment to test if requests/limits that are left assigned on the completed job (pod) will indeed have influence on the ResourceQuota that I've set.

This is how my ResourceQuota looks like:

apiVersion: v1
kind: ResourceQuota
metadata:
  name: mem-cpu-quota
spec:
  hard:
    requests.cpu: "1"
    requests.memory: 2Gi
    limits.cpu: "2"
    limits.memory: 3Gi

This is request/limit for cpu/memory that exists on each k8s job (to be precise on the container running in the Pod which is spinned up by Job):

resources:
  limits:
    cpu: 250m
    memory: 250Mi
  requests:
    cpu: 100m
    memory: 100Mi

Results of testing:

Currently running number of jobs: 66
Expected sum of CPU requests (if assumption from the question is correct) ~= 6.6m
Expected sum of Memory requests (if assumption from the question is correct) ~= 6.6Mi
Expected sum of CPU limits (if assumption from the question is correct) ~= 16.5
Expected sum of Memory limits (if assumption from the question is correct) ~= 16.5

I've created Grafana graphs that show following:

CPU usage/requests/limits for jobs in one namespace

sum(rate(container_cpu_usage_seconds_total{namespace="${namespace}", container="myjob"}[5m]))
sum(kube_pod_container_resource_requests_cpu_cores{namespace="${namespace}", container="myjob"})
sum(kube_pod_container_resource_limits_cpu_cores{namespace="${namespace}", container="myjob"})

Memory usage/requests/limits for jobs in one namespace

sum(rate(container_memory_usage_bytes{namespace="${namespace}", container="myjob"}[5m]))
sum(kube_pod_container_resource_requests_memory_bytes{namespace="${namespace}", container="myjob"})
sum(kube_pod_container_resource_limits_memory_bytes{namespace="${namespace}", container="myjob"})

This is how graphs look like:

According to this graph, requests/limits get accumulated and go well beyond the ResourceQuota thresholds. However, I'm still able to run new jobs without a problem.

At this moment, I've started doubting in what metrics are showing and opted to check other part of the metrics. To be specific, I've used following set of metrics:

CPU:

sum (rate(container_cpu_usage_seconds_total{namespace="$namespace"}[1m]))
kube_resourcequota{namespace="$namespace", resource="limits.cpu", type="hard"}
kube_resourcequota{namespace="$namespace", resource="requests.cpu", type="hard"}
kube_resourcequota{namespace="$namespace", resource="limits.cpu", type="used"}
kube_resourcequota{namespace="$namespace", resource="requests.cpu", type="used"}

Memory:

sum (container_memory_usage_bytes{image!="",name=~"^k8s_.*", namespace="$namespace"})
kube_resourcequota{namespace="$namespace", resource="limits.memory", type="hard"}
kube_resourcequota{namespace="$namespace", resource="requests.memory", type="hard"}
kube_resourcequota{namespace="$namespace", resource="limits.memory", type="used"}
kube_resourcequota{namespace="$namespace", resource="requests.memory", type="used"}

This is how graph looks like:

Conclusion:

From this screenshot, it is clear that, once load test completes and jobs go into the complete state, even though pods are still around (with READY: 0/1 and STATUS: Completed), cpu/memory request/limits are released and no longer represent constraint that needs to be calculated into the ResourceQuota threshold. This can be seen by observing following data on the graph:

CPU allocated requests
CPU allocated limits
Memory allocated requests
Memory allocated limits

all of which increase at the point of time when load happens on the system and new jobs are created but goes back into the previous state as soon as jobs are completed (even though they are not deleted from the environment)

In other words, resource usage/limits/request for both cpu/memory are taken into the account only while job (and its corresponding pod) is in RUNNING state

score 1 · Answer 2 · answered Mar 07 '20 at 09:06

If you do kubectl get pod you can see the pod that was created by the Job still exists in the list, for example:

NAME                                                            READY   STATUS      RESTARTS   AGE
cert-generator-11b35c51b71ea3086396a780dbf20b5cd695b25d-wvb7t   0/1     Completed   0          57d

Thus, any resource requests/limits are still utilized by the pod. To release the resource, you can manually delete the pod. It will be re-created the next time the job runs.

You can also configure the job (and hence the pod) to be auto-deleted from history upon success and/or failure by using .spec.ttlSecondsAfterFinished on the Job. But you would lose the way to know if the job was successful or not though.

Or if your job is actually created by CronJob, then you can configure the job (and hence the pod) to be auto-deleted .spec.successfulJobsHistoryLimit and .spec.failedJobsHistoryLimit on the CronJob.

At this moment, usage of TTL is still not an option for me (I'm using EKS cluster, so it will take some time to get this feature since it is in alpha state: https://github.com/aws/containers-roadmap/issues/255). Since my job is not created by cronjob second approach, that you suggested, is also not applicable for my case. — Bakir Jusufbegovic, Mar 07 '20 at 20:08

What happens with memory requests/limits specified in K8s job (container) when job is completed?

2 Answers2