How to gracefully avoid divide by zero in Prometheus

Question

There are times when you need to divide one metric by another metric.

For example, I'd like to calculate a mean latency like that:

rate({__name__="hystrix_command_latency_total_seconds_sum"}[60s])
/
rate({__name__="hystrix_command_latency_total_seconds_count"}[60s])

If there is no activity during the specified time period, the rate() in the divider becomes 0 and the result of division becomes NaN. If I do some aggregation over the results (avg() or sum() or whatever), the whole aggregation result becomes NaN.

So I add a check for zero in divider:

rate({__name__="hystrix_command_latency_total_seconds_sum"}[60s])
/
(rate({__name__="hystrix_command_latency_total_seconds_count"}[60s]) > 0)

This removes NaNs from the result vector. And also tears the line on the graph to shreds.

Let's mark periods of inactivity with 0 value to make the graph continuous again:

rate({__name__="hystrix_command_latency_total_seconds_sum"}[60s])
/
(rate({__name__="hystrix_command_latency_total_seconds_count"}[60s]) > 0)
or
rate({__name__="hystrix_command_latency_total_seconds_count"}[60s]) > bool 0

This effectively replaces NaNs with 0, graph is continuous, aggregations work OK.

But resulting query is slightly cumbersome, especially when you need to do more label filtering and do some aggregations over results. Something like that:

avg(
    1000 * increase({__name__=~".*_hystrix_command_latency_total_seconds_sum", command_group=~"$commandGroup", command_name=~"$commandName", job=~"$service", instance=~"$instance"}[60s])
    /
    (increase({__name__=~".*_hystrix_command_latency_total_seconds_count", command_group=~"$commandGroup", command_name=~"$commandName", job=~"$service", instance=~"$instance"}[60s]) > 0)
    or
    increase({__name__=~".*_hystrix_command_latency_total_seconds_count", command_group=~"$commandGroup", command_name=~"$commandName", job=~"$service", instance=~"$instance"}[60s]) > bool 0
) by (command_group, command_name)

Long story short: Are there any simpler ways to deal with zeros in divider? Or any common practices?

score 9 · Accepted Answer · answered Nov 01 '17 at 14:05

9

If there is no activity during the specified time period, the rate() in the divider becomes 0 and the result of division becomes NaN.

This is the correct behaviour, NaN is what you want the result to be.

aggregations work OK.

You can't aggregate ratios. You need to aggregate the numerator and denominator separately and then divide.

So:

   sum by (command_group, command_name)(rate(hystrix_command_latency_total_seconds_sum[5m]))
  /
   sum by (command_group, command_name)(rate(hystrix_command_latency_total_seconds_count[5m]))

answered Nov 01 '17 at 14:05

brian-brazil

31,678
6
93
86

Thanks! I've got your point. Actually, I was hoping that I overlooked something like `ifNaN(, default_value_scalar)` that replaces all `NaN`s in the vector, or some language construct that acts this way, and somebody will point me to that. Your variant looks very good, but still needs to be written like `sum(...) / (sum(...) > 0) or sum(...) > bool 0` in case the sum in divider becomes 0 for particular (command_group, command_name) label combination. – Yoory N. Nov 02 '17 at 06:53
You should not filter with `> 0` here, as it results in missing samples in the time series. NaN is the correct result, as that's what you get when you divide by 0. – brian-brazil Nov 02 '17 at 10:15
"it results in missing samples" -- that's why I added the `or sum(...) > bool 0` part to the query, to fill the gaps with zeros, representing some default value that's suitable for me in this case. – Yoory N. Nov 08 '17 at 06:17
The latency during those periods isn't 0 though, it's NaN. – brian-brazil Nov 08 '17 at 09:44

eventhorizon · Answer 2 · 2020-06-24T06:56:00.847

Finally I have a solution for my specific problem:

Having a devision by zero leads to a NaN display - that is fine as a technical result and correct but not what the user wants to see (does not fulfil the business requirement).

So I searched a bit and found a "solution" for my problem in the grafana community:

Surround your problematic value with max(YOUR_PROLEMATIC_QUERY, or vector(-1)). An additional value mapping then leads to a useful output.

(Of course you have to adapt the solution to your problem... min/max... vector(42)/vector(101)/vector(...))

Update (1)

Okay. However. It seems to be a bit more tricky based on the query. For example I have another query that fails with NaN as a result of a devision by zero. The above solution does not work. I had to surround the query with brackets and added > 0 or on() vector(100).

score 3 · Answer 3 · answered Mar 30 '22 at 11:53

Just add > smallest_value to the query before wrapping it into aggregate function such as avg(), where smallest_value is the value, which is smaller than any expected valid result for the inner query. For example:

avg((
  rate({__name__="hystrix_command_latency_total_seconds_sum"}[60s])
  /
  rate({__name__="hystrix_command_latency_total_seconds_count"}[60s])
) > -1e12)

Prometheus removes NaN values when comparing them to any number with > operator. For example, NaN >bool -1e12 . The same applies to < operator as well, e.g. NaN <bool 1e12 . So either > or < may be used for filtering NaN values before aggregating them with aggregate functions.

P.S. This trick isn't needed in MetricsQL, since VictoriaMtrics automatically skips NaN values when aggregate functions are applied to them.

Bizhan · Answer 4 · 2023-01-02T09:13:50.547

2

Based on @eventhorizen's answer, if you have a query as denominator that can sometimes return zero, it can mess up the graph and show infinity where there is no data. You can limit the results to a valid range.

For example the output of this metric is supposed to be between 0 and 1, but it also produces INFINITY when there is no data:

(1/increase(SOMETIMES_ZERO_QUERY[1m]))

In this case you can write this instead, so it shows 0 instead of values larger than 100:

max((1/increase(SOMETIMES_ZERO_QUERY[1m]))<100 or on() vector(0))

or if you want 1 as INFINITY:

max((1/increase(SOMETIMES_ZERO_QUERY[1m]))<100 or on() vector(1))

edited Jan 02 '23 at 09:13

answered Feb 08 '21 at 16:15

Bizhan

16,157
9
63
101

Why is the <100 there? – TheJeff Nov 07 '22 at 18:22
@TheJeff it is there to block INFINITY values. you can replace it with any large number. – Bizhan Nov 08 '22 at 14:51

score 0 · Answer 5 · answered Jan 19 '22 at 23:15

I faced the same issue and so had implemented the solution in this way:-

increase({metric query}[2m]) / (increase({problematic zero giving metric query}[2m]))!=0 or on() vector(1) > 150

where I had to check whether the denominator is giving 0 which in turn would give infinity and the graph would be absurd and nonsensical. So in order to avoid this, have put condition !=0 or vector(1) so that in case the denominator becomes 0 anytime then its value would be returned as 1 always.

How to gracefully avoid divide by zero in Prometheus

5 Answers5