Detecting Only Increasing Prometheus Metrics Over A Given Interval

Question

Recently, I've experienced quite a few false positives in an existing Prometheus-based alert that has been difficult to nail down (that seemingly should be simple), so I thought I'd inquire to see if there was something obviously wrong with the query or the thought process behind it.

I have a Kafka consumer that handles reading from several different Kafka topics and an associated metric kafka_consumergroup_lag that stores a value for the expected lag for the consumer group itself. The metric exposes the following two important labels:

consumergroup - The name of the consumer group
topic - The name of the topic being consumed from

The typical pattern of lag (via a sum(kafka_consumergroup_lag) by (consumergroup, topic)) looks something like this:

However, if it began continually growing (without any decreases for a period of time), this could indicate a much larger problem.

How can I construct a query to only detect increases for a given consumergroup-topic combination over the period of an hour? I'd like to use this metric in conjunction with a Grafana-based alert to fire when this threshold is met (e.g. "Consumer Lag for $topic in $consumergroup has continually increased for an hour")

At present, I'm using something like the following which I thought would be sufficient, but I'm still seeing false positives being reported:

sum(increase(kafka_consumergroup_lag{consumergroup="my-consumer-group"}[1h])) by (consumergroup, topic) >= 3600

This has an associated alerting condition of:

EVALUATE every 1m for 5m (since if it's already been increasing, we'd want to alert after 5 minutes)
WHEN last() of query(queryName, 5m, now) is above 0

This seems like a fairly easy thing to detect, but it has been quite a challenge to get it correct.

score 1 · Answer 1 · answered Aug 17 '23 at 21:13

1

Query like this will return consumer-topic pairs, whose lag has not decreased once during last hour:

sum(increase(kafka_consumergroup_lag{consumergroup="my-consumer-group"}[1h])) by (consumergroup, topic) >= 3600
and sum by (consumergroup, topic) (resets(kafka_consumergroup_lag{consumergroup="my-consumer-group"}[1h])) == 0

Here function resets is used. It's designed to be used with counters, but since your gauge is somewhat counter like, it can be used too.

answered Aug 17 '23 at 21:13

markalex

8,623
2
7
32

Hi @markalex! Thanks for the response, I tried using this against some know historical data and seemed to get some odd results with larger gaps. – Rion Williams Aug 18 '23 at 00:38
One idea that I considered was to get do a sum of the current value with an offset of the same value a minute prior. This would cause positive values to indicate lag growth and negative to indicate lag decreasing. A positive value for some duration would be the alerting threshold. Does that seem reasonable? – Rion Williams Aug 18 '23 at 00:39
Something like `sum(metric_name{ … } - metric_name{ … } offset 1m) by (consumergroup, topic)` – Rion Williams Aug 18 '23 at 00:40
@RionWilliams, oh you are right. This query will produce strange result because of `increase`. It is designed to work with counters, not gauges. Use `delta` instead (it will produce similar result to `metric_name{ … } - metric_name{ … } offset 1h`). Otherwise this query should work. – markalex Aug 18 '23 at 05:01

Detecting Only Increasing Prometheus Metrics Over A Given Interval

1 Answers1