Background
Our code is written with:
- Unit tests
- End to end tests
- Code review
- Staging process
- Deployment process
On the contrary, our alerts are just written and then modified occasionally manually. No quality process at all.
This process is reasonable for simple threshold checks. However, our alerts are sometimes built on complicated queries. Sometimes composed of ~20 lines of a query.
If we accidentally break an alert, it could expose us to production instability since we won't know if some logic or component breaks.
The question
Is there a recommended methodology for validating the quality of complicated alerts?
P.S.
We're using Splunk alerts