0

I have a set of services. Every service contains some components.

Some of them are stateless, some of them are stateful, some are synchronous, some are asynchronous.

I used different approaches to monitoring and alerting.

Log-based alerting and metrics gathering. New Relic based. Own bicycle.

Basically, atm I am looking for a way, how to generalize and aggregate important metrics for all services in single place. One of things, I want is that we monitor more products, than separate services.

As an end result I see it as a single dashboard with small amount of widgets, but looking at those widgets I would be able to say for sure, if services are usable to end-customer.

Probably someone can recommend me some approach/methodology. Or give a reference to some best practices.

Jevgeni Smirnov
  • 3,787
  • 5
  • 33
  • 50

2 Answers2

0

I like what you're trying to achieve! A service is not production-ready unless it's thoroughly monitored.

I believe what your're describing goes into the topics of health-checking and metrics.

... I would be able to say for sure, if services are usable to end-customer.

That however will require a little of both ;-) To ensure you're currently fulfilling your SLA, you have to make sure, that your services are all a) running and b) perform as requested. With both problems I suggest to look at the StatsD toolchain. Initially developed by Etsy, it has become the de-facto standard for gathering metrics.

To ensure all your services are running, we're relaying Kubernetes. It takes our description for what should run, be reachable from outside etc. and hosts that on our infrastructure. It also makes sure, that should things die - that they will be restarted. It helps with things like auto-scaling etc. as well! Awesome tooling and kudos to Google! The way it ensures that is with health-checks. There are multiple ways how you can ensure your service node booted by Kubernetes is alive and kicking (namely HTTP calls and CLI scripts but this should be a modular thing should you need anything else!) If Kubernetes detects unhealthy nodes it will immediately phase them out and start another node instead.

Now, making sure, all your services perform as expected you'll need to gather some metrics. For all of our services (and all individual endpoints), we gather a few metrics via StatsD like:

  • Requests/sec
  • number of errors returned (404, etc...)
  • Response times (Average, Median, Percentiles depending on the services SLA)
  • Payload size (Average)
  • sometimes the number of concurrent requests per endpoint, the number of instances currently running
  • general metrics like the hosts current CPU and memory usage and uptime.

We gather a lot more metrics but that's about the bottom line. Since StatsD has become more of a "protocol specification" than a concrete product there are a myriad of collector, front- and backends to choose from. They help you visualize your systems state and many of them feature alerts of something or some combination of metrics go beyond their thresholds.

Let me know, if this was helpfull!

enzian
  • 733
  • 5
  • 13
0

There's at least 3 types of things you will need to monitor: the host where the service is deployed, the component itself and the SLAs and some of them depend on the software stack you're using as well as the architecture.

With that said, you could for example use Nagios to monitor the hardware where the services are deployed, Splunk for the services metrics/SLAs as well as for any errors that might occur. You can also use SNMP packages in case something goes wrong and you have a more sophisticated support structure, this would be yours triggers. Without knowing how your infrastructure/services are set up it is complicated to go into deeper details.

MeTitus
  • 3,390
  • 2
  • 25
  • 49