There are times when you need to divide one metric by another metric.
For example, I'd like to calculate a mean latency like that:
rate({__name__="hystrix_command_latency_total_seconds_sum"}[60s])
/
rate({__name__="hystrix_command_latency_total_seconds_count"}[60s])
If there is no activity during the specified time period, the rate()
in the divider becomes 0
and the result of division becomes NaN
.
If I do some aggregation over the results (avg()
or sum()
or whatever), the whole aggregation result becomes NaN
.
So I add a check for zero in divider:
rate({__name__="hystrix_command_latency_total_seconds_sum"}[60s])
/
(rate({__name__="hystrix_command_latency_total_seconds_count"}[60s]) > 0)
This removes NaN
s from the result vector. And also tears the line on the graph to shreds.
Let's mark periods of inactivity with 0
value to make the graph continuous again:
rate({__name__="hystrix_command_latency_total_seconds_sum"}[60s])
/
(rate({__name__="hystrix_command_latency_total_seconds_count"}[60s]) > 0)
or
rate({__name__="hystrix_command_latency_total_seconds_count"}[60s]) > bool 0
This effectively replaces NaN
s with 0
, graph is continuous, aggregations work OK.
But resulting query is slightly cumbersome, especially when you need to do more label filtering and do some aggregations over results. Something like that:
avg(
1000 * increase({__name__=~".*_hystrix_command_latency_total_seconds_sum", command_group=~"$commandGroup", command_name=~"$commandName", job=~"$service", instance=~"$instance"}[60s])
/
(increase({__name__=~".*_hystrix_command_latency_total_seconds_count", command_group=~"$commandGroup", command_name=~"$commandName", job=~"$service", instance=~"$instance"}[60s]) > 0)
or
increase({__name__=~".*_hystrix_command_latency_total_seconds_count", command_group=~"$commandGroup", command_name=~"$commandName", job=~"$service", instance=~"$instance"}[60s]) > bool 0
) by (command_group, command_name)
Long story short: Are there any simpler ways to deal with zeros in divider? Or any common practices?