2

I am using Prometheus alertmanager to set alerts on some of the metrics. One of the metrics is using group by query and then the alert is set on that generic query.

Example: The metric on Grafana dashboard to calculate the time since last successful training of the model:

time() - max_over_time(max(spark_job_success_time{model=~"mymodel.*"})  by (model) [24h:1m])

This query creates a separate time series for each model with name matching with mymodel.*.

I want to set an alert (using Prometheus alertmanager) on this metric which would be triggered whenever a particular model (say model='mymodel.abc') crosses the threshold set by the alert.

Right now, the expression is like this:

max(<the_above_query>) > 100

But this triggers only once whenever one model crosses the threshold, and this alert is not triggered for subsequent models that also cross the threshold (i.e. is triggered at most once regardless of multiple models crossing the threshold set in the alert).

I want to create an alert for each model and would like to have the alert triggered as many time as the number of models crossed the threshold. How to do this using templates in alertmanager?

exAres
  • 4,806
  • 16
  • 53
  • 95
  • You mentioned it only triggers once but if you go to ```/alerts``` do you see the alerts? If so it might be worth checking out the inhibition rules in the alertmanager config: https://prometheus.io/docs/alerting/latest/configuration/#inhibit_rule – Saf Nov 03 '20 at 08:10

0 Answers0