2

At this link, it states the reason why rate must be used for micrometer metrics.

Representing a counter without rate normalization over some time window is rarely useful, as the representation is a function of both the rapidity with which the counter is incremented and the longevity of the service.

I am still not able to comprehend why not just do sum/count.

Any input is helpful.

Ivan Aracki
  • 4,861
  • 11
  • 59
  • 73
Mandroid
  • 6,200
  • 12
  • 64
  • 134

1 Answers1

4

The reason you want the rate() function in the Prometheus query is so you can see what the average rate was in that time window ([10s] in that doc example).

If instead you are using the overall sum/count then that number will continue to grow and the average won't cover the latest time frame, but instead will be the average of all timing since the service started.

Example:

Imagine you have a timing that takes 1 second each time it is called and it is called about 30 times each minute:

                   Count        Sum        sum/count   sum/count (with increase)
First Minute:      30           30         1           1
After 10 hour:     18,000       18,000     1           1
After 1000 hours:  1,800,000    1,800,000  1           1

So far it looks identical. Now assume that for the last 1 minute all the requests take 10 seconds. Which is 10 times as slow. You would want to know about that last minute

                   Count        Sum        sum/count   sum/count (with increase)
First Minute:      30           300        10          10
After 10 hour:     18,000       18,270     1.015       10
After 1000 hours:  1,800,000    1,800,270  1.00015     10

The rate (or increase) function ensures that it is just using the change in that window for the calculation. As the metric is running for longer period, the large number masks any volatility.

Note: In my example I used the increase function since it is a little easier to reason through. It just reports how much has the counter or sum increased in that window. rate is similar, but just normalizes it to a per/second rate.

Eugene
  • 117,005
  • 15
  • 201
  • 306
checketts
  • 14,167
  • 10
  • 53
  • 82