I am new to prometheus and alerting system. I have developed a microservice and added metrics code to get the total number of increments whenever there is an error. Now I am trying to create an alert so that whenever there is an increment in the error, it should flag out and send a mail. but I am unable to form a proper query for this scenario. I have used something like error_total > 0 to send an alert but it will work everytime since the count will be > 0 unless we reset it manually.
Asked
Active
Viewed 3,593 times
1 Answers
4
What you are looking for is the increase function. The following expression trigger en error whenever there was an error in the previous 15min:
expr: increase(my_error_metric[15m]) > 0
annotations:
summary: "Hey! There were {{ $value }} errors in the last 15 minutes"
Errors are common in microservices and alerting on each of them is generally unmanageable. A more common strategy is to alert only when the error rate exceeds a given threshold (by example 5%):
expr: irate(my_error_metric[2m]) / irate(number_of_call[2m]) * 100 > 5
Alerting on increase may also mean you can miss some errors because the alert is triggered on the error but another error occurs during investigation. There won't be a second alert, it will be included in the first one.

Michael Doubez
- 5,937
- 25
- 39
-
Hi @Michael Doubez , thank you for your response. The expression increase(my_error_metric[15m]) > 0 does not returns any response for the first error, but once the second error comes the expression returns a value such as 1.66 etc. What can be the reason for this behavior? – Ashish Gupta Mar 16 '20 at 08:46
-
Do you publish the metric when there is no error ? (with value 0) That's the only thing that comes to mind. – Michael Doubez Mar 16 '20 at 14:16
-
Will this trigger a separate alerts for all consecutive alerts within 15min? – somebody Sep 13 '21 at 09:26
-
I am not sure I understand the question. Alert expressions detect a state; a signal is sent by prometheus (to alert manager) whenever the state changes and at regular interval. If there is no error increase during 15 minutes, the alert will be resolved; if there is an increase, the alert is in "firing" state. If you are talking about notification, it is handled at alert manager level. – Michael Doubez Sep 13 '21 at 12:23