Prometheus Get a summary of down time ranges

Question

I'm monitoring some services with blackbox_exporter and prometheus. This works great to calculate the service availability but I'm questioning myself if it is possible to get a summary of down time ranges in the last x days with PromQL?

For example if probe_success turns 0 between 1 PM and 1:30 PM and than again from 3 to 3:15 PM I want to get a list like this one in Grafana:

Downtime:

1 PM - 1:30 PM | 30 mins
3 PM - 3:15 PM | 15 mins

and so on.

Michael Doubez · Answer 1 · 2021-07-29T11:16:31.840

What you are asking is difficult with PromQL. Prometheus is a time series database and you want to recover the events from those metrics.

There is a way to recover the events where the status 0/1 of a metric changed:

you would use the changes() function with a detection range matching the poll interval of your metric to extract the change event (if the poll interval is wrong, you will see duplicated changes and may miss some event)
```
changes(metric[30s]) != 0
```
and then use the actual metric value to identify up/down switch
```
(changes(metric[30s]) != 0) * metric
```

You can visualize the output using sub-query: ((changes(metric[30s]) != 0) * metric)[2d:]

0 @1627421720
1 @1627427120
0 @1627508120
1 @1627513520

The value gives you the new state, and the timestamp (after @) gives you the epoch time of the event (approximately depending on poll time).

We are not far from what you want, the difficulty being the way to take those metrics and transform them into the consolidated table.

I uses Grafana v8.0.4 at the time of this answer and I don't see an way to integrate that in the current table visualization. My best advice would be to use a HTML panel and run you own JavaScript to display what you want.

Dustlay · Answer 2 · 2022-12-23T15:06:22.490

The answer from Michael Doubez didn't work for me, although my poll time is also set to 30s. When I execute this query:

probe_success{target="https://example.com"}[30s]

I always get only one result. But as we want to detect changes, we need two data points to compare. So I set the interval to 60s and my final query is this:

((changes(probe_success{target="https://example.com"}[60s]) != 0) * probe_success)[14d:30s]

This gets the changes for the specified target in the last 14 days. The result should look somehow like this:

| Time                | instance                                                       | job      | target              | Value |
| ------------------- | -------------------------------------------------------------- | -------- | ------------------- | ----- |
| 2022-12-13 22:05:30 | prometheus-blackbox-exporter.prometheus-blackbox-exporter:9115 | blackbox | https://example.com | 0     |
| 2022-12-13 22:06:00 | prometheus-blackbox-exporter.prometheus-blackbox-exporter:9115 | blackbox | https://example.com | 1     |
| 2022-12-14 08:52:00 | prometheus-blackbox-exporter.prometheus-blackbox-exporter:9115 | blackbox | https://example.com | 0     |
| 2022-12-14 08:52:30 | prometheus-blackbox-exporter.prometheus-blackbox-exporter:9115 | blackbox | https://example.com | 1     |

Based on this you can generate your downtime ranges.

I created an example with python3:

#!/usr/bin/env python3

from datetime import datetime
import requests


def print_downtimes(probe_results):
    # based on the following query and 30s polling interval
    # ((changes(probe_success[60s]) != 0) * probe_success)[14d:30s]
    all_downtimes = {}
    downtime_build_lookup = {}
    for host_result in probe_results:
        change_target = host_result['metric'].get('target')
        for timestamp, change_value in host_result['values']:
            change_timestamp = int(timestamp)
            if int(change_value) == 0:
                if change_target not in downtime_build_lookup:
                    downtime_build_lookup[change_target] = {"down_from": -1, "down_to": -1}
                downtime_build_lookup[change_target]["down_from"] = change_timestamp
            else:
                if change_target not in downtime_build_lookup:
                    downtime_build_lookup[change_target] = {"down_from": -1, "down_to": -1}
                downtime_build_lookup[change_target]["down_to"] = change_timestamp
                if change_target not in all_downtimes:
                    all_downtimes[change_target] = []
                all_downtimes[change_target].append(downtime_build_lookup[change_target])
                downtime_build_lookup[change_target] = {"down_from": -1, "down_to": -1}
    for target, target_downtimes in all_downtimes.items():
        print(target)
        for downtime in target_downtimes:
            print(f"Down from {datetime.fromtimestamp(downtime['down_from'])} to {datetime.fromtimestamp(downtime['down_to'])}")


if __name__ == '__main__':
    PROMETHEUS_URL = 'http://localhost:9090/'
    response = requests.get(PROMETHEUS_URL + 'api/v1/query', params={
        'query': '((changes(probe_success[60s]) != 0) * probe_success)[14d:30s]',
    })
    data = response.json()

    print_downtimes(data['data']['result'])

Which prints something like this:

https://example.com
Down from 2022-12-14 09:01:30 to 2022-12-14 09:02:00
Down from 2022-12-14 09:21:30 to 2022-12-14 09:22:00
Down from 2022-12-14 18:40:00 to 2022-12-14 18:40:30
https://example2.com
Down from 2022-12-15 13:10:30 to 2022-12-15 13:11:30

Prometheus Get a summary of down time ranges

2 Answers2