2

Background:

I have a Spring boot Kafka consumer, and I am trying to monitor it using Prometheus and Grafana. For that, I am using the Spring's inbuild MeterRegistry. The metric I am using for counting total events consumed is kafka_consumer_fetch_manager_records_consumed_total. The idea is to use a query like this sum(increase(kafka_consumer_fetch_manager_records_consumed_total [$__range])) by (topic). The metrics are stored on victoria metrics and queried by Grafana

Quetion:

In this process, I have noticed a strange thing. After I restarted the consumer, the value of the metric is reset and went to 192. Now in this case, if I apply the increase() function, I expect that the final output is also 192 (since from the start the total event consumed is 192). However with increase() I get 152. I don't really understand why is that, can someone please help?

Here is the screenshot from Grafana:

The raw values of kafka_consumer_fetch_manager_records_consumed_total

enter image description here

And the values of kafka_consumer_fetch_manager_records_consumed_total with increase()

enter image description here

PS: While writing this question, I noticed, that when the consumer restarted, the metrics didn't reset to 0 but it started from 40. Could this be the issue? If yes then how could I solve it?

FYI, this is how I am registering Kafka metrics from Spring boot.

DefaultKafkaConsumerFactory<String, Map<String, String>> defaultKafkaConsumerFactory = new DefaultKafkaConsumerFactory<>(config, new StringDeserializer(),
                new ErrorHandlingDeserializer<>(new JsonDeserializer<>(Map.class)));

defaultKafkaConsumerFactory.addListener(new MicrometerConsumerListener<>({@Autowired MeterRegistry}));

I tried PromQL increase(sum(kafka_consumer_fetch_manager_records_consumed_total) by (namespace) [$__range]) and surprisingly it produced correct result. But as per this https://www.robustperception.io/rate-then-sum-never-sum-then-rate/ Its not and ideal query to use.

1 Answers1

3

Technically, your counter wasn't reset, but simply started with value 40. As seen on first screenshot, previous run of counter is considered as separate time series: the have different labels.

Since counter wasn't reset, but simply started from value 40, behavior of increase() is correct.

increase() applied after sum() consideres two counters as the same time series, and even though there is no actual point with value 0, it assumes it existed (because Prometheus expects counters to reset to 0), and returns 192 as you expected.

Linked material is not really related to your situation: it speaks about aggregation of multiple series, while you use sum to convert two technically different series into one. If it is guaranteed that selector inside sum produces non more that 1 series at a time, increase after sum is acceptable.

markalex
  • 8,623
  • 2
  • 7
  • 32
  • Thanks, I understand the issue. Any suggestion on how to solve it? I think one solution is to lower the scraping interval, but I'm not sure if that too gives a guaranteed initial 0 value at restart (our Kafka traffic is very high and consumers will start consuming data as soon as they're active). And I dont want to lower the scraping interval to too much. Also this values are just testing values, in final version, we'll have multiple pods of consumer running so the selector inside sum produces will have multiple series and any one can reset at any time. – Meet Rathod Jul 03 '23 at 04:03
  • 1
    @MeetRathod, correct solution would be to make Prometheus treat counters after reset as the same metric (for the same pod,topic, service, and so on). It is not visible what label causes the problem on your screenshot, but you could try something like `sum without(problem) (kafka_consumer_fetch_manager_records_consumed_total)`, where `problem` is the name of the label causing this behaviour. – markalex Jul 03 '23 at 09:15