0

I'm using Kafka exporter to monitor the Kafka metrics which is then queried in prometheus. I have a Kafka topic with 3 consumer groups, these 3 consumer groups are used by 3 different services. I am trying to write a query to have an alert when either of these consumer group lag increases beyond the average lag.

The query I have so far:

kafka_consumer_group_lag{group_id=~"consumer_group.*"} > avg_over_time(kafka_consumer_group_lag{group_id=~"consumer_group.*"}[5m]) 

But this doesn't seem to work. I'm not sure how to form the calculation to get this. Can someone help me in understanding how to form this query? The entire group_id will not be known, the starting of the group_id will be consumer_group hence I'm using the wild card.

perplexedDev
  • 857
  • 4
  • 17
  • 49
  • Your query should work. It should return time series, where current value more than average value of the same series over last 5 minutes. Check your query and it's both parts at Prometheus' `/graph` page (switch to graph tab) to see if query returns expected result, or even any result. – markalex Apr 14 '23 at 07:42
  • Also notice that this query should produce quite a lot of false alarms in case of a sudden short increase of message producing intensity. – markalex Apr 14 '23 at 07:44
  • "But this doesn't seem to work" - what do you mean by this? can you explain more on what part is not working? – Isaiah4110 Apr 14 '23 at 13:22
  • @markalex is there a way I avoid false alarms like you mentioned? – perplexedDev Apr 14 '23 at 16:48
  • With this exact approach - no. If produces puts 100000 messages into the queue at once, no softening of the rule will help. You'll need to base your alert on some other indicator. For example, combination of number of processed items and queue size. Or something like "queue has not shrinked for last 5 min". Depending on your situation. – markalex Apr 14 '23 at 16:55
  • Have you managed to find out why your query initially "doesn't seem to work"? – markalex Apr 14 '23 at 16:56
  • Not yet, i get empty results for the query. I tried them individually as well and get empty result for the avg_over_time query as well as kafka_consumer_group_lag. I might be missing something here – perplexedDev Apr 14 '23 at 17:12
  • And does to work without regex selector? – markalex Apr 14 '23 at 17:18

0 Answers0