5

I have a gke cluster with some workloads that can have boot issues. is it possible to create a stackdriver notification when a workload runs into an issue.

For example: create an incident when CrashLoopBackOff is triggered, pods are unshedulable or the Workload Status is anything other than OK for 5 minutes.

Laures
  • 5,389
  • 11
  • 50
  • 76

1 Answers1

7

You can use log-based metrics to track all the CrashLoopBackOff states in your pods, using the following advanced query:

https://cloud.google.com/logging/docs/view/advanced-queries

resource.type="k8s_pod"
resource.labels.location="us-central1-a"
resource.labels.cluster_name="standard-cluster-1"
"myproject"
jsonPayload.message="Back-off restarting failed container"
resource.labels.pod_name:"myproject"

Pods unschedulable might go into crashloopbackoff or not be deployed, which is only traceable at the API server.

We need to consider that to make the log based metrics, it's necessary to adapt the labels depending on the monitoring version (whether you have legacy or non-legacy) - "non-legacy" monitoring & metrics are used in this example

Create the metric via log-based metrics and you'll find them in Monitoring as logging/user/xxxx

https://cloud.google.com/logging/docs/logs-based-metrics/

When you have a metric created you can create an alert policy to notify you when the issue occurs.

Wojtek_B
  • 4,245
  • 1
  • 7
  • 21
  • great advice. since the back-of message has no reference to the deployment i can only create a "somethings wrong" alert, but thats ok. unschedulable pods generate another log message with `jsonPayload.reason="FailedScheduling"` so this is also possible. – Laures Dec 05 '19 at 17:01
  • You can combine the two and create a alert when any of them occurs and then you will have a "not ok" alerting in place. If you fin'd my reply useful please accept it :) – Wojtek_B Dec 06 '19 at 09:21
  • @Laures how do you isolate the `FailedScheduling` ones that aren't ephemeral? We often see those around rollouts and wouldn't like to be alerted for those – Angad Feb 14 '22 at 13:47
  • i don't. releases are part of my responsibility so i can just ignore notifications for 10 minutes or notify my team that a rollout is going to happen – Laures Feb 15 '22 at 14:21