I'm currently facing an issue where one of my Kubernetes nodes keeps experiencing DiskPressure, leading to pod evictions and disruption of services. Despite our best efforts, we are struggling to identify the root cause of this problem. I'm seeking guidance on how to analyze and debug the issue effectively.
Here's the background information and steps we have taken so far:
- Kubernetes Version: 1.24.1
- Node Specifications:
- OS: Ubuntu 20.04.4 LTS (amd64)
- Kernel: 5.13.0-51-generic
- Container runtime: containerd://1.6.6
- Pod and Resource Utilization:
Capacity:
cpu: 16
ephemeral-storage: 256Gi
hugepages-1Gi: 0
hugepages-2Mi: 0
memory: 65776132Ki
pods: 110
Allocatable:
cpu: 16
ephemeral-storage: 241591910Ki
hugepages-1Gi: 0
hugepages-2Mi: 0
memory: 65673732Ki
pods: 110
System Info:
Kernel Version: 5.13.0-51-generic
OS Image: Ubuntu 20.04.4 LTS
Operating System: linux
Architecture: amd64
Container Runtime Version: containerd://1.6.6
Kubelet Version: v1.24.1
Kube-Proxy Version: v1.24.1
Non-terminated Pods: (41 in total)
Namespace Name CPU Requests CPU Limits Memory Requests Memory Limits Age
--------- ---- ------------ ---------- --------------- ------------- ---
cert-manager cert-manager-7686fcb9bc-jptct 0 (0%) 0 (0%) 0 (0%) 0 (0%) 46m
cert-manager cert-manager-cainjector-69d77789d-kmzb9 0 (0%) 0 (0%) 0 (0%) 0 (0%) 46m
cert-manager cert-manager-webhook-84c6f5779-gs8h7 0 (0%) 0 (0%) 0 (0%) 0 (0%) 46m
devops external-dns-7bdcbb7658-rvwqs 0 (0%) 0 (0%) 0 (0%) 0 (0%) 46m
devops filebeat-7l62m 100m (0%) 0 (0%) 100Mi (0%) 200Mi (0%) 20m
devops jenkins-597c5d498c-prs5x 0 (0%) 0 (0%) 0 (0%) 0 (0%) 14m
devops kibana-6b577f877c-28ck4 100m (0%) 1 (6%) 0 (0%) 0 (0%) 46m
devops logstash-788d5f89b-pr79c 0 (0%) 0 (0%) 0 (0%) 0 (0%) 14m
devops nexus-6db65f8744-cxlhs 0 (0%) 0 (0%) 0 (0%) 0 (0%) 46m
devops powerdns-authoritative-85dcd685c4-4mts8 0 (0%) 0 (0%) 0 (0%) 0 (0%) 46m
devops powerdns-recursor-757854d6f8-5z25p 0 (0%) 0 (0%) 0 (0%) 0 (0%) 46m
devops powerdns-recursor-nok8s-5db55c87f9-77ww6 0 (0%) 0 (0%) 0 (0%) 0 (0%) 46m
devops sonarqube-5767c467c9-2crz2 0 (0%) 0 (0%) 200Mi (0%) 0 (0%) 46m
devops sonarqube-postgres-0 0 (0%) 0 (0%) 0 (0%) 0 (0%) 46m
ingress-nginx ingress-nginx-controller-75f6588c7b-gw77s 100m (0%) 0 (0%) 90Mi (0%) 0 (0%) 13m
jenkins-agents my-cluster-dev-tenant-develop-328-76mr4-ns67p-3xczd 0 (0%) 0 (0%) 350Mi (0%) 0 (0%) 72s
kube-system calico-kube-controllers-56cdb7c587-zmz4t 0 (0%) 0 (0%) 0 (0%) 0 (0%) 46m
kube-system calico-node-pshn4 250m (1%) 0 (0%) 0 (0%) 0 (0%) 354d
kube-system coredns-6d4b75cb6d-nrbmq 100m (0%) 0 (0%) 70Mi (0%) 170Mi (0%) 46m
kube-system coredns-6d4b75cb6d-q9hvs 100m (0%) 0 (0%) 70Mi (0%) 170Mi (0%) 46m
kube-system etcd-my-cluster 100m (0%) 0 (0%) 100Mi (0%) 0 (0%) 354d
kube-system kube-apiserver-my-cluster 250m (1%) 0 (0%) 0 (0%) 0 (0%) 354d
kube-system kube-controller-manager-my-cluster 200m (1%) 0 (0%) 0 (0%) 0 (0%) 354d
kube-system kube-proxy-qwmrd 0 (0%) 0 (0%) 0 (0%) 0 (0%) 354d
kube-system kube-scheduler-my-cluster 100m (0%) 0 (0%) 0 (0%) 0 (0%) 354d
kube-system metrics-server-5744cd7dbb-h758l 100m (0%) 0 (0%) 200Mi (0%) 0 (0%) 34m
kube-system metrics-server-6bf466fbf5-nt5k6 100m (0%) 0 (0%) 200Mi (0%) 0 (0%) 47m
kube-system node-shell-0c3bde15-32fa-4831-9f05-ebfe5d14a909 0 (0%) 0 (0%) 0 (0%) 0 (0%) 43m
kube-system node-shell-692c6032-8301-44ac-b12e-e5a222a6f80a 0 (0%) 0 (0%) 0 (0%) 0 (0%) 8m6s
lens-metrics prometheus-0 100m (0%) 0 (0%) 512Mi (0%) 0 (0%) 14m
imaginary-dev mailhog-7f666fdfbf-xgcwf 0 (0%) 0 (0%) 0 (0%) 0 (0%) 46m
imaginary-dev ms-nginx-766bf76f87-ss8h6 0 (0%) 0 (0%) 0 (0%) 0 (0%) 46m
imaginary-dev ms-tenant-f847987cc-rf9db 400m (2%) 500m (3%) 500M (0%) 700M (1%) 46m
imaginary-dev ms-webapp-5d6bcdcc4f-x68s4 100m (0%) 200m (1%) 200M (0%) 400M (0%) 46m
imaginary-dev mysql-0 0 (0%) 0 (0%) 0 (0%) 0 (0%) 46m
imaginary-dev redis-0 0 (0%) 0 (0%) 0 (0%) 0 (0%) 46m
imaginary-uat mailhog-685b7c6844-cpmfp 0 (0%) 0 (0%) 0 (0%) 0 (0%) 46m
imaginary-uat ms-tenant-6965d68df8-nlm7p 500m (3%) 600m (3%) 512M (0%) 704M (1%) 46m
imaginary-uat ms-webapp-6cb7fb6c65-pfhsh 100m (0%) 200m (1%) 200M (0%) 400M (0%) 46m
imaginary-uat mysql-0 0 (0%) 0 (0%) 0 (0%) 0 (0%) 46m
imaginary-uat redis-0 0 (0%) 0 (0%) 0 (0%) 0 (0%) 46m
Allocated resources:
(Total limits may be over 100 percent, i.e., overcommitted.)
Resource Requests Limits
-------- -------- ------
cpu 2800m (17%) 2500m (15%)
memory 3395905792 (5%) 2770231040 (4%)
ephemeral-storage 2Gi (0%) 0 (0%)
hugepages-1Gi 0 (0%) 0 (0%)
hugepages-2Mi 0 (0%) 0 (0%)
Events: <none>
- Disk Usage Analysis: We looked at the disk usage on the node using the
du
anddf
commands.
Despite the above efforts, we haven't been able to pinpoint the exact cause of the DiskPressure issue. We suspect it could be related to excessive logging, large container images, or inefficient resource allocation, but we are unsure how to confirm and resolve these suspicions.
Therefore, I kindly request assistance with the following:
- Best practices for analyzing and debugging DiskPressure issues in Kubernetes nodes.
- Tools or techniques to identify the specific processes or pods that are consuming the most disk space.
- Strategies to optimize resource allocation and disk usage within Kubernetes to mitigate DiskPressure problems.
- Any additional insights or recommendations for troubleshooting this issue effectively.
Any suggestions, recommendations, or experience-based insights would be greatly appreciated. Thank you in advance for your assistance!