17

We are about to setup Prometheus for monitoring and alerting for our cloud services including a continous integration & deployment pipeline for the Prometheus service and configuration like alerting rules / thresholds. For that I am thinking about 3 categories I want to write automated tests for:

  1. Basic syntax checks for configuration during deployment (we already do this with promtool and amtool)
  2. Tests for alert rules (what leads to alerts) during deployment
  3. Tests for alert routing (who gets alerted about what) during deployment
  4. Recurring check if the alerting system is working properly in production

Most important part to me right now is testing the alert rules (category 1) but I have found no tooling to do that. I could imagine setting up a Prometheus instance during deployment, feeding it with some metric samples (worrying how would I do that with the Pull-architecture of Prometheus?) and then running queries against it.

The only thing I found so far is a blog post about monitoring the Prometheus Alertmanager chain as a whole related to the third category.

Has anyone done something like that or is there anything I missed?

bsingr
  • 188
  • 1
  • 1
  • 7

1 Answers1

12

New version of Prometheus (2.5) allows to write tests for alerts, here is a link. You can check points 1 and 2. You have to define data and expected output (for example in test.yml):

rule_files:
    - alerts.yml
evaluation_interval: 1m
tests:
# Test 1.
- interval: 1m
  # Series data.
  input_series:
      - series: 'up{job="prometheus", instance="localhost:9090"}'
        values: '0 0 0 0 0 0 0 0 0 0 0 0 0 0 0'
      - series: 'up{job="node_exporter", instance="localhost:9100"}'
        values: '1+0x6 0 0 0 0 0 0 0 0' # 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0

  # Unit test for alerting rules.
  alert_rule_test:
      # Unit test 1.
      - eval_time: 10m
        alertname: InstanceDown
        exp_alerts:
            # Alert 1.
            - exp_labels:
                  severity: page
                  instance: localhost:9090
                  job: prometheus
              exp_annotations:
                  summary: "Instance localhost:9090 down"
                  description: "localhost:9090 of job prometheus has been down for more than 5 minutes."

You can run tests using docker:

docker run \
-v $PROJECT/testing:/tmp \
--entrypoint "/bin/promtool" prom/prometheus:v2.5.0 \
test rules /tmp/test.yml

promtool will validate if your alert InstanceDown from file alerts.yml was active. Advantage of this approach is that you don't have to start Prometheus.

streak
  • 1,121
  • 1
  • 19
  • 28
KrzyH
  • 4,256
  • 1
  • 31
  • 43
  • 2
    This is awesome! – bsingr Dec 14 '18 at 14:00
  • Are the `values` here actual values that the query would return from Prometheus? Or a fake data, supplied only to create conditions for the alert to fire? – jayarjo Aug 18 '22 at 09:15
  • @jayarjo yes, the `values` are simulated data used to test the alert. There are some good examples in the attached link. – Andy May 04 '23 at 13:14