26

Because Prometheus topk returns more results than expected, and because https://github.com/prometheus/prometheus/issues/586 requires client-side processing that has not yet been made available via https://github.com/grafana/grafana/issues/7664, I'm trying to pursue a different near-term work-around to my similar problem.

In my particular case most of the metric values that I want to graph will be zero most of the time. Only when they are above zero are they interesting.

I can find ways to write prometheus queries to filter data points based on the value of a label, but I haven't yet been able to find a way to tell prometheus to return time series data points only if the value of the metric meets a certain condition. In my case, I want to filter for a value greater than zero.

Can I add a condition to a prometheus query that filters data points based on the metric value? If so, where can I find an example of the syntax to do that?

Steve Dwire
  • 385
  • 1
  • 3
  • 9

3 Answers3

33

If you're confused by brian's answer: The result of filtering with a comparison operator is not a boolean, but the filtered series. E.g.

min(flink_rocksdb_actual_delayed_write_rate > 0)

Will show the minimum value above 0.

In case you actually want a boolean (or rather 0 or 1), use something like

sum (flink_rocksdb_actual_delayed_write_rate >bool 0)

which will give you the greater-than-zero count.

Caesar
  • 6,733
  • 4
  • 38
  • 44
26

Filtering is done with the comparison operators, for example x > 0.

brian-brazil
  • 31,678
  • 6
  • 93
  • 86
  • 1
    Yes, and I can find ways to reference the value of a label in place of x in the example, but I have not been able to figure out how to reference the value of the metric itself. – Steve Dwire Oct 11 '17 at 21:32
  • For example, consider a metric named items_in_queue, with a label queue_name. I want to show a graph of how many items are in each queue, but if a queue had zero items in it for the entire duration of my graph, I don't want its name to show up in the legend. And if most of my queues are empty most of the time, I don't want a different collection of zero-depth queue names to show up in my results for every sample. What would a query look like to show me topk items_in_queue, but only when items_in_queue > 0? How do I reference the metric instead of a label? – Steve Dwire Oct 11 '17 at 21:33
  • I imagine it would look something like `items_in_queue{ _something_ > 0 }` but what do I put in place of _something_? – Steve Dwire Oct 11 '17 at 21:42
  • What do you mean "reference the metric instead of a label"? Labels are strings, you can't do math on them. – brian-brazil Oct 12 '17 at 00:13
  • Maybe I'm missing something even more basic. I've looked at both https://prometheus.io/docs/querying/operators and https://prometheus.io/docs/querying/examples/, but I can't seem to find any example that shows how/where those binary comparison operators fit into a query. Can you point me to an example of a complete query that shows a binary comparison operator in context? – Steve Dwire Oct 12 '17 at 11:12
  • 1
    https://www.robustperception.io/combining-alert-conditions/ Per above, `x > 0` is a complete PromQL expression. – brian-brazil Oct 12 '17 at 14:56
  • 7
    So - to elaborate a bit... If I wanted a grafana graph that shows a time series of queue depth for all queues (items_in_queue metric) on a particular host (labeled by host_name) but only those queues whose items_in_queue metric was greater than zero, that query would look something like... `items_in_queue{host_name=~"myhost"} > 0`. The filter on the label value goes inside the `{}`, and the filter based on the metric value goes after the `{...}`. Is that the way it works? – Steve Dwire Oct 12 '17 at 15:02
5

This can be solved with subqueries:

count_over_time((metric > 0)[5m:10s])

The query above would return the number of metric data points greater than 0 over the last 5 minutes.

This query may return inaccurate results depending on the relation between the second arg in square brackets (aka step for the inner query) and the real interval between raw samples (aka scrape_interval):

  • If the step exceeds scrape_interval, them some samples may be missing during the calculations. In this case the query will return lower than expected result.
  • If the step is smaller than the scrape_interval, then some samples may be counted multiple times. In this case the query will return bigger than expected result.

So it is recommended setting the step to scrape_interval in order to get accurate results.

P.S. The issues mentioned above are solved in VictoriaMetrics - Prometheus-like monitoring system I work on. It provides count_gt_over_time() function, which ideally fits this case. For example, the following MetricsQL query returns the exact number of raw samples with values greater than 0 over the last 5 minutes:

count_gt_over_time(metric[5m], 0)
valyala
  • 11,669
  • 1
  • 59
  • 62
  • 2
    For others: make sure to include two time ranges within the brackets, otherwise you'll get an error. Took me a bit to figure that out. – vahlala Mar 26 '20 at 15:39