2

We have an application deployed on GKE with a total of 10 pods running and serving the application. I am trying to find the metrics using which I can create an alert when my Pod goes down or is there a way to check the status of Pods so that I can set up an alert based on that condition?

I explored GCP and looked into their documentation but couldn't find anything. What I could find is one metric below but I don't know what it measures. To me it looks like a number of times Kubernetes thinks a pod has died and it restarts the pod.

Metric: kubernetes.io/container/restart_count
Resource type: k8s_container

Any advice on this is highly appreciated as we can improve our monitoring based on this metric

GCP alerting policy creation

mikita agrawal
  • 571
  • 1
  • 12
  • 27

2 Answers2

2

That metric is the same you are right it will the count of POD restart.

Number of times the container has restarted. Sampled every 60 seconds. After sampling, data is not visible for up to 120 seconds.

Read more at : https://cloud.google.com/monitoring/api/metrics_kubernetes

Or

You can use Prometheus to get the metrics and monitor with Grafana

sum(kube_pod_container_status_restarts_total{cluster="$cluster",namespace="$namespace",pod=~"$service.*"})

This will give the value of the POD restart count.

OR

You can also use the BotKube : https://www.botkube.io/installation/

You can set to notify when your readiness liveness fails to slack notification etc..

Or

You write your own script and run it on Kubernetes to monitor and notify when any POD restart in cluster.

Example github : https://github.com/harsh4870/Slack-Post-On-POD-Ready-State

This script notifies in slack when POD becomes ready after deployment, you can change it to monitor the restart count.

i would recommend using Prometheus, Grafana option, however, stackdriver is Good but i am not Google employee.

Harsh Manvar
  • 27,020
  • 6
  • 48
  • 102
  • Thanks. I tried to use this metric on GCP alerting and I am seeing something I have attached to the question. So two things here - 1) Why the restart count is always showing 10 as base? Is it because my cluster has 10 pods running. Ideally restart count should be zero so I am just thinking what should I do for my threshold value. 2) Another thing is does that count at any time signifying that these number of times Kubernetes has restarted the pods? – mikita agrawal Jan 11 '22 at 23:09
0

Why do you want to monitor when a pod is down? Kubernetes will immediatly try to start it on the same node or on a different one if that node is down for whatever reason.

Instead, there are other metrics you have to monitor for. Like the restart_count which could indicate that pods are not coming back online. But also other metrics like

  • kube_pod_container_status_restarts_total
  • kube_pod_status_phase
  • kube_pod_container_status_running
  • kube_pod_status_phase vs kube_node_status_capacity_pods

This article has a lot of interesting metrics to monitor for https://medium.com/google-cloud/gke-monitoring-84170ea44833

boredabdel
  • 1,732
  • 3
  • 7
  • whenever I am trying to use any of these metrics in my alerting policy I am seeing this on the graph - ```No data is available for the selected timeframe``` I tried to increase the timeframe to longer like 90 days but still no data. Is there any filter or advanced setting I need to do? – mikita agrawal Jan 11 '22 at 23:13
  • Check agent is properly running on your GKE or monitoring is enabled for GKE. – Harsh Manvar Jan 12 '22 at 03:30