Situation: I have Prometheus and Alertmanager setup to monitor, among other things, CPU temp of various devices. Alertmanager sends alerts from production devices to PagerDuty.
The devices I'm monitoring have different models with different operating specs. Normal CPU temp for models 1-5 is 50C, while for model 6 it's 70C. Currently the threshold for the CPU temp alerts is 60C, so PagerDuty keeps getting alerts from model 6 devices that are operating at their normal temperature.
Is there a way to filter out cpu temp alerts from only model 6 devices if the temp is below 80C and still get cpu temp alerts for model 1-5 devices at 60C?
Note: There are lots of other metrics that are being monitored, but for all of them other than CPU temp, all device models have the exact same thresholds.
Here is a snippet from my alertmanager.yml
that sends prod alerts to PagerDuty
- match:
stack_name: prod
severity: critical
receiver: PagerDuty
Admittedly, I don't have a great deal of YML experience. but this is what I'm hoping to do, but I'm not sure of the correct syntax:
- match:
stack_name: prod
severity: critical
alertname: !device_cpu_temperature
receiver: PagerDuty
- match:
stack_name: prod
severity: critical
alertname: device_cpu_temperature
uuid: !*6X*
receiver: PagerDuty
- match:
stack_name: prod
severity: critical
alertname: device_cpu_temperature
uuid: *6X*
value: >80
receiver: PagerDuty
Desired outcome:
- All critical prod alerts except device_cpu_temperature are sent to PagerDuty
- Critical prod device_cpu_temperature alerts are only sent to PagerDuty if the model number isn't 6 (uuid contains the model number followed by an 'X')
- Critical prod device_cpu_temperature alerts from model 6 devices are sent to PagerDuty only if the cpu temp is above 80C.
Or would it be better to have 2 different alert rules in prometheus? Can certain rules be applied to only certain devices? If so, how?