0

Here are our requirements.

  1. measure close-to real time average web page latency (which are hosted on multiple instances of AWS ecs) We want our service to serve a page, say.. less than a second

  2. error status other than http 200 doesn't spike up we want to know if there's a problem.

  3. separate services like elasticsearch is not down

  4. we are logging some critical errors (such as purchase failing) in sentry or elasticsearch and want to know if it doesn't spike up

  5. it's nice to have a single monitoring ui, and have an alarm when certain conditions are met.

I don't know if we need to build a service ourselves, I'm hoping we can use some ready-made service.

Where should we collect data ?
I've been looking at

  • elasticsearch, kibana (lacking alarm)
  • statsd (seems like we need separate front for visualization)
  • netdata (looks more like system monitoring tool than data aggregating tool)
  • munin, nagios (not sure if these are what we need)
eugene
  • 39,839
  • 68
  • 255
  • 489

2 Answers2

0

It seems like DataDog could be a good solution for you. You can use it to monitor Elasticsearch, and it has an APM product you can integrate into your app to monitor its performance. If you monitor your app with Honeybadger, you can send metrics on those errors to DataDog, too.

Benjamin Curtis
  • 1,570
  • 12
  • 13
0

Zabbix can be well-handled most of them.

  • with "web scenario", the web page latency can be measurable via "web.test.in", "web.test.time"...
  • also "web.test.rspcode" will show you out the HTTP response code then trigger an alert where needed (example: 200, 400, 401, 404, 500, 503...)
  • you can easily monitor the elasticsearch using official template with extra zabbix_agentd settings
  • that might requires another dedicate services: Sentry, EFK stack, elastalert to archive the goal
  • Zabbix has its own centralize UI (with proxy supports), and any kind of alert (showed on dashboard, email, Slack, SMS, PagerDuty...)
Cuong Nguyen
  • 95
  • 1
  • 1
  • 10