I have been having some difficulty understanding how to go about the ideal threshold for few of our cloudwatch alarms. I am looking at metrics for error rates, fault rate and failure rate. I am vaguely looking at having an evaluation period of around 15 mins. My metrics are being recorded at a minute level currently. I have the following ideas:
- To look at the avg of minute level data over a few days, and set it slightly higher than that.
- To try different thresholds (t1,t2 ..) and for a given day, see how many times the datapoints are crossing it in 15 min bins.
Not sure if this is the right way of going about it, do share if there is a better way of going about the problem.
PS 1: I know that thresholds should be based on Service Level Agreements(SLA), but let's say we do not have an SLA yet.
PS 2: Also does can I import data from cloudwatch to excel for some easier manipulation? Currently looking at running a few queries on log insights to calculate error rates.