[Prometheus][Grafana] Computing timespan for each system state

Question

I would like to request for your help to compute the duration of each of a system state.

I am using Prometheus v2.36.2 and Grafana v9.0.3. I have a Prometheus gauge that I've called: load_gauge. I can have four states with this metric:

load_gauge >= 10 => Overload State,
load_gauge <= 5 => Underload State,
load_gauge > 5 && load_gauge < 10 => Regular State,
If prometheus is disconnected: State is Off.

I am trying to show on Grafana, for each given State, the duration of the state for the last 24 hours. A sample output would look like this: Overload for 1 hour, Underload for 6 hours, Regular for 2 hours, Off for 15 hours.

I played a lot with Grafana's metrics browser to build a query, something like:

count_over_time((load_gauge{job="prometheus"} > 10)[1d:])

but it does not seem to do the job. I also played with Grafana panels like Pie Chart, but it would show me just some percentage for all the states for the last 24 hours. Do you know if it is possible for me to get the duration directly, in terms of hours or minutes? Where should I eventually make the changes: from prometheus by aggregating metrics, or from Grafana?

Thank you in advance for your response,

Josh Verdi

score 0 · Answer 1 · answered Jan 05 '23 at 12:31

I used successfully the following solution, be careful though, it only works properly if you have a good sampling of both start and en moment of your states:

import "array"

states_time_serie = from(bucket: "$bucket")
  |> range(start: v.timeRangeStart, stop:v.timeRangeStop)
  |> filter(fn: (r) => r.host == "$host")
  |> filter(fn: (r) => r._measurement == "mqtt_observatory_status")
  |> filter(fn: (r) => r._field == "state")

unique_states = states_time_serie
  |> unique()
  |> findColumn(fn: (key) => true, column: "_value")

// Define a helper function to extract a row as a record
getRow = (tables=<-, idx=0) => {
    extract = tables
        |> findRecord(fn: (key) => true, idx: idx)
    return extract
}

get_state_duration = (x) => {
  state_duration = states_time_serie
    // For each input table, elapsed() returns the same table without the first row (because there is no previous time to derive the elapsed time from) and an additional column containing the elapsed time.
    |> stateDuration(fn: (r) =>
      r._value == x,
      column: "state_duration",
      unit: 1s
      )
    // We need to sort backward in order to get positive value when doing difference on elapsed time to correspond on the current state
    //|> sort(columns: ["_time"], desc: true)
    // For each input table with n rows, difference() outputs a table with n - 1 rows.
    |> difference(columns: ["state_duration"], nonNegative: false)
    |> keep(columns: ["_time", "state_duration"])
    |> filter(fn: (r) => r["state_duration"] >= 0)
    |> sum(column: "state_duration")
    |> set(key: "state_name", value: x)
    |> getRow()
  return state_duration
}

// Computer the state duration for each state
states_durations = unique_states |> array.map(fn: get_state_duration)

// Output the array of records as a table
array.from(rows: states_durations)

In that case what was mostly difficult was to find automatically the list of states given a time serie range. In your case if you already know the definition of each state, it would be most likely easier to rely solely on stateduration

For the record, this feature is particularly useful to be used in a pie-chart to come with the state timeline, as in this example:

[Prometheus][Grafana] Computing timespan for each system state

1 Answers1