3

We use Prometheus Alertmanager for alerts. Frequently, we are missing metrics because of some connection problems.

So, when metrics are missing, Prometheus clear alerts and send resolved alert. After a few minutes, connection problem fixed and firing alerts repeating.

Is there any way to stop the resolved alerts when metric data missing?

For example; When a node down, other alerts for the node(cpu, disk usage controls) are resolved.

values on alertmanager config:

  repeat_interval: 1d
  resolve_timeout: 15m

  group_wait: 1m30s
  group_interval: 5m

  scrape_interval: 1m
  scrape_timeout: 1m 
  evaluation_interval: 30s

NodeDown alert:

  - alert: NodeDown
    expr: up == 0
    for: 30s
    labels:
      severity: critical
      alert_group: host
    annotations:
      summary: "Node is down: instance {{ $labels.instance }}"
      description: "Can't react to node_exporter at {{ $labels.instance }}. Probably host is down."
        

2 Answers2

1

Alertmanager can inhibit (=automatically silence) alerts on certain conditions. You will not see inhibited alerts neither firing, nor resolving until the inhibiting condition is false again. Here is an example of one such rule:

inhibit_rules:
- # Mute alerts with "severity" label equals to "warning" ...
  target_matchers:
  - severity = warning

  # ... when an alert named "ExporterDown" is firing ...
  source_matchers:
  - alertname = ExporterDown

  # ... if both the muted and the firing alerts have exactly the same "job" and "instance" labels.
  equal: [instance, job]

To summarize, the above automatically silences all warning alerts for a certain machine, when the metric source is down. The link above will lead you to the documentation, where you can find more on the subject.

anemyte
  • 17,618
  • 1
  • 24
  • 45
  • 1
    Hi @anemyte , Thank you so much. I applied this. NodeDown alerts override resolved alerts. But sometimes when metric is gone(min time interval), NodeDown alert is not occuring, just notifying resolved alert. – Melike Sozeri Mar 10 '22 at 10:18
  • 1
    @MelikeSozeri Oh there is a number of reasons why something like this could happen. For example, there can be something like a race condition, when one alert goes resolved before another. Check the `ALERTS` metric to see if they resolve simultaneously, if not - consider tuning `evaluation_interval` and `for` (in alert definition). There is also [alert grouping](https://prometheus.io/docs/alerting/latest/alertmanager/#grouping), which has it's own rules and intervals, check these too. Sorry for the short reply, it may take several pages to explain all this, it wouldn't fit into a comment. – anemyte Mar 10 '22 at 10:58
  • Thank you. I know these values and I tested an updated many times. But I'm in stuck on resolved alerts. When metrics are gone, I don't want to get notified resolved alerts before NodeDown alert. How should i change these values? Which one should be bigger than the other? I added my config values to in my question. – Melike Sozeri Mar 10 '22 at 11:11
0

Did you consider using the last_over_time function? Like this:

last_over_time(up[2h]) == 0
vb8448
  • 21
  • 2