I have a prometheus counter (spring_batch_job_seconds_count{status=~'FAILED'}) that counts job failures. I want to graph job failures over time and alert on job failures. The increase function gives me what I want except for the first occurrence. The counter is not published until a failure occurs, so there is no increase (or delta or rate) on the first failure event since there is no previous counter value of 0 to compare the first non-zero counter value to. How can I create a graph that will show the first failure occurrence (as well as subsequent failure occurrences) and a corresponding alert that will trigger on the first failure occurrence (as well as future failure occurrences)? I might be willing to settle for two alerts: one that triggers when the counter increments, and one that triggers on the first occurrence, but I would not want to have to manually shut off the alert that triggers on the first occurrence after it triggers for the first time.
Asked
Active
Viewed 939 times
2
-
Can you change the instrumentation code? If yes, just initialize the metric with an increment of 0. – trallnag Sep 09 '20 at 09:58
1 Answers
2
I managed to do this with falco metrics.
I want to alert on any change, even the first time a metric appears.
(sum(falco_events{k8s_pod_name="runner"} or falco_events{} * 0) by (k8s_pod_name, rule) - sum(falco_events{k8s_pod_name="runner"} offset 5m or falco_events{} * 0) by (k8s_pod_name, rule))
Workaround from here: https://github.com/prometheus/prometheus/issues/1673

Kim
- 1,757
- 1
- 17
- 32
-
Thanks, I messed around with `or vector(0)` for ages but this works much nicer! – cfstras Mar 21 '23 at 15:33