I am using Prometheus alertmanager to set alerts on some of the metrics. One of the metrics is using group by
query and then the alert is set on that generic query.
Example: The metric on Grafana dashboard to calculate the time since last successful training of the model:
time() - max_over_time(max(spark_job_success_time{model=~"mymodel.*"}) by (model) [24h:1m])
This query creates a separate time series for each model
with name matching with mymodel.*
.
I want to set an alert (using Prometheus alertmanager) on this metric which would be triggered whenever a particular model (say model='mymodel.abc'
) crosses the threshold set by the alert.
Right now, the expression is like this:
max(<the_above_query>) > 100
But this triggers only once whenever one model
crosses the threshold, and this alert is not triggered for subsequent models
that also cross the threshold (i.e. is triggered at most once regardless of multiple models crossing the threshold set in the alert).
I want to create an alert for each model
and would like to have the alert triggered as many time as the number of models crossed the threshold. How to do this using templates in alertmanager?