1

Recently, I've experienced quite a few false positives in an existing Prometheus-based alert that has been difficult to nail down (that seemingly should be simple), so I thought I'd inquire to see if there was something obviously wrong with the query or the thought process behind it.

I have a Kafka consumer that handles reading from several different Kafka topics and an associated metric kafka_consumergroup_lag that stores a value for the expected lag for the consumer group itself. The metric exposes the following two important labels:

  • consumergroup - The name of the consumer group
  • topic - The name of the topic being consumed from

The typical pattern of lag (via a sum(kafka_consumergroup_lag) by (consumergroup, topic)) looks something like this:

enter image description here

However, if it began continually growing (without any decreases for a period of time), this could indicate a much larger problem.

How can I construct a query to only detect increases for a given consumergroup-topic combination over the period of an hour? I'd like to use this metric in conjunction with a Grafana-based alert to fire when this threshold is met (e.g. "Consumer Lag for $topic in $consumergroup has continually increased for an hour")

At present, I'm using something like the following which I thought would be sufficient, but I'm still seeing false positives being reported:

sum(increase(kafka_consumergroup_lag{consumergroup="my-consumer-group"}[1h])) by (consumergroup, topic) >= 3600

This has an associated alerting condition of:

  • EVALUATE every 1m for 5m (since if it's already been increasing, we'd want to alert after 5 minutes)
  • WHEN last() of query(queryName, 5m, now) is above 0

This seems like a fairly easy thing to detect, but it has been quite a challenge to get it correct.

Rion Williams
  • 74,820
  • 37
  • 200
  • 327

1 Answers1

1

Query like this will return consumer-topic pairs, whose lag has not decreased once during last hour:

sum(increase(kafka_consumergroup_lag{consumergroup="my-consumer-group"}[1h])) by (consumergroup, topic) >= 3600
and sum by (consumergroup, topic) (resets(kafka_consumergroup_lag{consumergroup="my-consumer-group"}[1h])) == 0

Here function resets is used. It's designed to be used with counters, but since your gauge is somewhat counter like, it can be used too.

markalex
  • 8,623
  • 2
  • 7
  • 32
  • Hi @markalex! Thanks for the response, I tried using this against some know historical data and seemed to get some odd results with larger gaps. – Rion Williams Aug 18 '23 at 00:38
  • One idea that I considered was to get do a sum of the current value with an offset of the same value a minute prior. This would cause positive values to indicate lag growth and negative to indicate lag decreasing. A positive value for some duration would be the alerting threshold. Does that seem reasonable? – Rion Williams Aug 18 '23 at 00:39
  • Something like `sum(metric_name{ … } - metric_name{ … } offset 1m) by (consumergroup, topic)` – Rion Williams Aug 18 '23 at 00:40
  • @RionWilliams, oh you are right. This query will produce strange result because of `increase`. It is designed to work with counters, not gauges. Use `delta` instead (it will produce similar result to `metric_name{ … } - metric_name{ … } offset 1h`). Otherwise this query should work. – markalex Aug 18 '23 at 05:01