Here are our requirements.
measure close-to real time average web page latency (which are hosted on multiple instances of AWS ecs) We want our service to serve a page, say.. less than a second
error status other than http 200 doesn't spike up we want to know if there's a problem.
separate services like elasticsearch is not down
we are logging some critical errors (such as purchase failing) in sentry or elasticsearch and want to know if it doesn't spike up
it's nice to have a single monitoring ui, and have an alarm when certain conditions are met.
I don't know if we need to build a service ourselves, I'm hoping we can use some ready-made service.
Where should we collect data ?
I've been looking at
- elasticsearch, kibana (lacking alarm)
- statsd (seems like we need separate front for visualization)
- netdata (looks more like system monitoring tool than data aggregating tool)
- munin, nagios (not sure if these are what we need)