We use Prometheus Alertmanager for alerts. Frequently, we are missing metrics because of some connection problems.
So, when metrics are missing, Prometheus clear alerts and send resolved alert. After a few minutes, connection problem fixed and firing alerts repeating.
Is there any way to stop the resolved alerts when metric data missing?
For example; When a node down, other alerts for the node(cpu, disk usage controls) are resolved.
values on alertmanager config:
repeat_interval: 1d
resolve_timeout: 15m
group_wait: 1m30s
group_interval: 5m
scrape_interval: 1m
scrape_timeout: 1m
evaluation_interval: 30s
NodeDown alert:
- alert: NodeDown
expr: up == 0
for: 30s
labels:
severity: critical
alert_group: host
annotations:
summary: "Node is down: instance {{ $labels.instance }}"
description: "Can't react to node_exporter at {{ $labels.instance }}. Probably host is down."