2

I have a prometheus counter (spring_batch_job_seconds_count{status=~'FAILED'}) that counts job failures. I want to graph job failures over time and alert on job failures. The increase function gives me what I want except for the first occurrence. The counter is not published until a failure occurs, so there is no increase (or delta or rate) on the first failure event since there is no previous counter value of 0 to compare the first non-zero counter value to. How can I create a graph that will show the first failure occurrence (as well as subsequent failure occurrences) and a corresponding alert that will trigger on the first failure occurrence (as well as future failure occurrences)? I might be willing to settle for two alerts: one that triggers when the counter increments, and one that triggers on the first occurrence, but I would not want to have to manually shut off the alert that triggers on the first occurrence after it triggers for the first time.

David Lewine
  • 101
  • 2
  • 12
  • Can you change the instrumentation code? If yes, just initialize the metric with an increment of 0. – trallnag Sep 09 '20 at 09:58

1 Answers1

2

I managed to do this with falco metrics.

I want to alert on any change, even the first time a metric appears.

(sum(falco_events{k8s_pod_name="runner"} or falco_events{} * 0) by (k8s_pod_name, rule) - sum(falco_events{k8s_pod_name="runner"} offset 5m or falco_events{} * 0) by (k8s_pod_name, rule))

Workaround from here: https://github.com/prometheus/prometheus/issues/1673

Kim
  • 1,757
  • 1
  • 17
  • 32