Frequent DiskPressure in Kubernetes Nodes

Question

I'm currently facing an issue where one of my Kubernetes nodes keeps experiencing DiskPressure, leading to pod evictions and disruption of services. Despite our best efforts, we are struggling to identify the root cause of this problem. I'm seeking guidance on how to analyze and debug the issue effectively.

Here's the background information and steps we have taken so far:

Kubernetes Version: 1.24.1
Node Specifications:
- OS: Ubuntu 20.04.4 LTS (amd64)
- Kernel: 5.13.0-51-generic
- Container runtime: containerd://1.6.6
Pod and Resource Utilization:

Capacity:
  cpu:                16
  ephemeral-storage:  256Gi
  hugepages-1Gi:      0
  hugepages-2Mi:      0
  memory:             65776132Ki
  pods:               110
Allocatable:
  cpu:                16
  ephemeral-storage:  241591910Ki
  hugepages-1Gi:      0
  hugepages-2Mi:      0
  memory:             65673732Ki
  pods:               110
System Info:
  Kernel Version:             5.13.0-51-generic
  OS Image:                   Ubuntu 20.04.4 LTS
  Operating System:           linux
  Architecture:               amd64
  Container Runtime Version:  containerd://1.6.6
  Kubelet Version:            v1.24.1
  Kube-Proxy Version:         v1.24.1
Non-terminated Pods:          (41 in total)
  Namespace                   Name                                               CPU Requests  CPU Limits  Memory Requests  Memory Limits  Age
  ---------                   ----                                               ------------  ----------  ---------------  -------------  ---
  cert-manager                cert-manager-7686fcb9bc-jptct                      0 (0%)        0 (0%)      0 (0%)           0 (0%)         46m
  cert-manager                cert-manager-cainjector-69d77789d-kmzb9            0 (0%)        0 (0%)      0 (0%)           0 (0%)         46m
  cert-manager                cert-manager-webhook-84c6f5779-gs8h7               0 (0%)        0 (0%)      0 (0%)           0 (0%)         46m
  devops                      external-dns-7bdcbb7658-rvwqs                      0 (0%)        0 (0%)      0 (0%)           0 (0%)         46m
  devops                      filebeat-7l62m                                     100m (0%)     0 (0%)      100Mi (0%)       200Mi (0%)     20m
  devops                      jenkins-597c5d498c-prs5x                           0 (0%)        0 (0%)      0 (0%)           0 (0%)         14m
  devops                      kibana-6b577f877c-28ck4                            100m (0%)     1 (6%)      0 (0%)           0 (0%)         46m
  devops                      logstash-788d5f89b-pr79c                           0 (0%)        0 (0%)      0 (0%)           0 (0%)         14m
  devops                      nexus-6db65f8744-cxlhs                             0 (0%)        0 (0%)      0 (0%)           0 (0%)         46m
  devops                      powerdns-authoritative-85dcd685c4-4mts8            0 (0%)        0 (0%)      0 (0%)           0 (0%)         46m
  devops                      powerdns-recursor-757854d6f8-5z25p                 0 (0%)        0 (0%)      0 (0%)           0 (0%)         46m
  devops                      powerdns-recursor-nok8s-5db55c87f9-77ww6           0 (0%)        0 (0%)      0 (0%)           0 (0%)         46m
  devops                      sonarqube-5767c467c9-2crz2                         0 (0%)        0 (0%)      200Mi (0%)       0 (0%)         46m
  devops                      sonarqube-postgres-0                               0 (0%)        0 (0%)      0 (0%)           0 (0%)         46m
  ingress-nginx               ingress-nginx-controller-75f6588c7b-gw77s          100m (0%)     0 (0%)      90Mi (0%)        0 (0%)         13m
  jenkins-agents              my-cluster-dev-tenant-develop-328-76mr4-ns67p-3xczd 0 (0%)        0 (0%)      350Mi (0%)       0 (0%)         72s
  kube-system                 calico-kube-controllers-56cdb7c587-zmz4t           0 (0%)        0 (0%)      0 (0%)           0 (0%)         46m
  kube-system                 calico-node-pshn4                                  250m (1%)     0 (0%)      0 (0%)           0 (0%)         354d
  kube-system                 coredns-6d4b75cb6d-nrbmq                           100m (0%)     0 (0%)      70Mi (0%)        170Mi (0%)     46m
  kube-system                 coredns-6d4b75cb6d-q9hvs                           100m (0%)     0 (0%)      70Mi (0%)        170Mi (0%)     46m
  kube-system                 etcd-my-cluster                                    100m (0%)     0 (0%)      100Mi (0%)       0 (0%)         354d
  kube-system                 kube-apiserver-my-cluster                          250m (1%)     0 (0%)      0 (0%)           0 (0%)         354d
  kube-system                 kube-controller-manager-my-cluster                 200m (1%)     0 (0%)      0 (0%)           0 (0%)         354d
  kube-system                 kube-proxy-qwmrd                                   0 (0%)        0 (0%)      0 (0%)           0 (0%)         354d
  kube-system                 kube-scheduler-my-cluster                          100m (0%)     0 (0%)      0 (0%)           0 (0%)         354d
  kube-system                 metrics-server-5744cd7dbb-h758l                    100m (0%)     0 (0%)      200Mi (0%)       0 (0%)         34m
  kube-system                 metrics-server-6bf466fbf5-nt5k6                    100m (0%)     0 (0%)      200Mi (0%)       0 (0%)         47m
  kube-system                 node-shell-0c3bde15-32fa-4831-9f05-ebfe5d14a909    0 (0%)        0 (0%)      0 (0%)           0 (0%)         43m
  kube-system                 node-shell-692c6032-8301-44ac-b12e-e5a222a6f80a    0 (0%)        0 (0%)      0 (0%)           0 (0%)         8m6s
  lens-metrics                prometheus-0                                       100m (0%)     0 (0%)      512Mi (0%)       0 (0%)         14m
  imaginary-dev               mailhog-7f666fdfbf-xgcwf                           0 (0%)        0 (0%)      0 (0%)           0 (0%)         46m
  imaginary-dev               ms-nginx-766bf76f87-ss8h6                          0 (0%)        0 (0%)      0 (0%)           0 (0%)         46m
  imaginary-dev               ms-tenant-f847987cc-rf9db                          400m (2%)     500m (3%)   500M (0%)        700M (1%)      46m
  imaginary-dev               ms-webapp-5d6bcdcc4f-x68s4                         100m (0%)     200m (1%)   200M (0%)        400M (0%)      46m
  imaginary-dev               mysql-0                                            0 (0%)        0 (0%)      0 (0%)           0 (0%)         46m
  imaginary-dev               redis-0                                             0 (0%)        0 (0%)      0 (0%)           0 (0%)         46m
  imaginary-uat               mailhog-685b7c6844-cpmfp                           0 (0%)        0 (0%)      0 (0%)           0 (0%)         46m
  imaginary-uat               ms-tenant-6965d68df8-nlm7p                         500m (3%)     600m (3%)   512M (0%)        704M (1%)      46m
  imaginary-uat               ms-webapp-6cb7fb6c65-pfhsh                         100m (0%)     200m (1%)   200M (0%)        400M (0%)      46m
  imaginary-uat               mysql-0                                            0 (0%)        0 (0%)      0 (0%)           0 (0%)         46m
  imaginary-uat               redis-0                                            0 (0%)        0 (0%)      0 (0%)           0 (0%)         46m
Allocated resources:
  (Total limits may be over 100 percent, i.e., overcommitted.)
  Resource           Requests         Limits
  --------           --------         ------
  cpu                2800m (17%)      2500m (15%)
  memory             3395905792 (5%)  2770231040 (4%)
  ephemeral-storage  2Gi (0%)         0 (0%)
  hugepages-1Gi      0 (0%)           0 (0%)
  hugepages-2Mi      0 (0%)           0 (0%)
Events:              <none>

Disk Usage Analysis: We looked at the disk usage on the node using the du and df commands.

Despite the above efforts, we haven't been able to pinpoint the exact cause of the DiskPressure issue. We suspect it could be related to excessive logging, large container images, or inefficient resource allocation, but we are unsure how to confirm and resolve these suspicions.

Therefore, I kindly request assistance with the following:

Best practices for analyzing and debugging DiskPressure issues in Kubernetes nodes.
Tools or techniques to identify the specific processes or pods that are consuming the most disk space.
Strategies to optimize resource allocation and disk usage within Kubernetes to mitigate DiskPressure problems.
Any additional insights or recommendations for troubleshooting this issue effectively.

Any suggestions, recommendations, or experience-based insights would be greatly appreciated. Thank you in advance for your assistance!

Frequent DiskPressure in Kubernetes Nodes

0 Answers0