Recently, I've experienced quite a few false positives in an existing Prometheus-based alert that has been difficult to nail down (that seemingly should be simple), so I thought I'd inquire to see if there was something obviously wrong with the query or the thought process behind it.
I have a Kafka consumer that handles reading from several different Kafka topics and an associated metric kafka_consumergroup_lag
that stores a value for the expected lag for the consumer group itself. The metric exposes the following two important labels:
consumergroup
- The name of the consumer grouptopic
- The name of the topic being consumed from
The typical pattern of lag (via a sum(kafka_consumergroup_lag) by (consumergroup, topic)
) looks something like this:
However, if it began continually growing (without any decreases for a period of time), this could indicate a much larger problem.
How can I construct a query to only detect increases for a given consumergroup-topic combination over the period of an hour? I'd like to use this metric in conjunction with a Grafana-based alert to fire when this threshold is met (e.g. "Consumer Lag for $topic in $consumergroup has continually increased for an hour")
At present, I'm using something like the following which I thought would be sufficient, but I'm still seeing false positives being reported:
sum(increase(kafka_consumergroup_lag{consumergroup="my-consumer-group"}[1h])) by (consumergroup, topic) >= 3600
This has an associated alerting condition of:
- EVALUATE every
1m
for5m
(since if it's already been increasing, we'd want to alert after 5 minutes) - WHEN
last()
ofquery(queryName, 5m, now)
is above0
This seems like a fairly easy thing to detect, but it has been quite a challenge to get it correct.