The reason you want the rate()
function in the Prometheus query is so you can see what the average rate was in that time window ([10s]
in that doc example).
If instead you are using the overall sum/count then that number will continue to grow and the average won't cover the latest time frame, but instead will be the average of all timing since the service started.
Example:
Imagine you have a timing that takes 1 second each time it is called and it is called about 30 times each minute:
Count Sum sum/count sum/count (with increase)
First Minute: 30 30 1 1
After 10 hour: 18,000 18,000 1 1
After 1000 hours: 1,800,000 1,800,000 1 1
So far it looks identical. Now assume that for the last 1 minute all the requests take 10 seconds. Which is 10 times as slow. You would want to know about that last minute
Count Sum sum/count sum/count (with increase)
First Minute: 30 300 10 10
After 10 hour: 18,000 18,270 1.015 10
After 1000 hours: 1,800,000 1,800,270 1.00015 10
The rate
(or increase
) function ensures that it is just using the change in that window for the calculation. As the metric is running for longer period, the large number masks any volatility.
Note: In my example I used the increase
function since it is a little easier to reason through. It just reports how much has the counter or sum increased in that window. rate
is similar, but just normalizes it to a per/second rate.