I'm defining a data series for testing a Prometheus alert using the container_last_seen
metric from the cadvisor
exporter.
How do I enter timestamp series values, as returned by the container_last_seen metric? I'm testing Prometheus alerts on an Apple Mac which run in production on Linux boxes.
Here's one thing I tried:
input_series:
- series: |
container_last_seen{container_label_com_docker_swarm_service_name="service1",env="prod",instance="10.0.0.1"}
values: '1563968832+0x61'
It seems whatever I put in the values for the series is not accepted.
I've also tried durations: '0h+1mx60'
As this is legal: time() - container_last_seen{...}
cls is definitely a timestamp, and I would expect a timestamp to be represented by a Unix epoch number. Executing the query on Prometheus gives Unix epoch times, but putting numbers in a series is rejected with the error below.
promtool
is recognising the different types but giving much the same error:
➜ promtool test rules alertrules-service-oriented-test.yml
Unit Testing: alertrules-service-oriented-test.yml
FAILED:
1:1: parse error: unexpected number "0" in series values
If the values are '1h+0mx61'
, promtool
correctly identifies the values as durations:
1:1: parse error: unexpected duration "1h" in series values
Note that when this test is commented out, there is no 1:1: parse error
and the tests complete successfully. This is not a problem with out of sight parts of the test file.
Thanks for any insights.
Here's the alert:
alertrules.yaml:
- name: containers
interval: 15s
rules:
- alert: prod_container_crashing
expr: |
count by (instance, container_label_com_docker_swarm_service_name)
(
count_over_time(container_last_seen{container_label_com_docker_swarm_service_name!="",env="prod"}[15m])
) - 1 > 2
for: 5m
labels:
service: prod
type: container
severity: critical
annotations:
summary: "pdce {{ $labels.container_label_com_docker_swarm_service_name }}"
description: "{{ $labels.container_label_com_docker_swarm_service_name }} in prod cluster on {{ $labels.instance }} is crashing"
and here's the test file:
alertrules_test.yml:
rule_files:
- alertrules.yml
evaluation_interval: 1m
tests:
- name: container_tests
interval: 15s
input_series:
- series: |
container_last_seen{container_label_com_docker_swarm_service_name="service1",env="prod",instance="10.0.0.1"}
values: '1563968832+0x61'
alert_rule_test:
- eval_time: 15m
alertname: prod_container_crashing
exp_alerts:
- exp_labels:
service: prod
type: container
severity: critical
exp_annotations:
summary: prod service1
description: service1 in prod cluster on 10.0.0.1 is crashing