1

In my project

  1. We crawls x number of server.
  2. Number of user for each server varies from 1 to n.
  3. We crawls 1 to z item for each user.

Currently we are monitoring QOS using graphite. We are storing time taken to crawl the item.

x.time_taken

Problem with this approach is that if only single user is affected we get false alert about QOS.

What will be the correct tool/technique to answer/monitor following points:

  1. Alert only if minimum k user are affected. [Not number of events]
  2. List of user which were affected.

I think graphite and statsd is not correct tool for this. What will be better tool for answering those two question ?

Vivek Goel
  • 22,942
  • 29
  • 114
  • 186

1 Answers1

0

What you are asking for is often called Service Monitoring. For very good reasons you want to know the service impact of an event, rather than just that an event has happened.

The advantage of this approach is exactly as you state in your requirements - you can focus on events which impact a large part of your user base and you have a list of the users affected right away.

The main drawback, IMHO, is that Service Monitoring is usually much more complex than simple performance or event/alert monitoring. It also often relies on a service model, which in my experience is something that is hard to build and even harder to keep up to date.

For example if a server in your system shows a significant slow down or failure, depending on your architecture this may impact all users who use a service that relies on that server, or it may impact a very small subset, or even none at all initially, if there is a load balancing mechanism or redundancy mechanism in place.

You would need to reflect this architecture in your service monitoring model, and also change it every time you update your system architecture or deployment.

If your system is static enough or critical enough to warrant the investment then this may be worth your while. If not then a simple compromise may be just to update the graphing and alerting you are doing to alert when the average response time over a set number of users, or over all users on a server increases by a significant amount.

This may give you most of the benefits you are after without having to invest in the extra complexity of a service monitoring solution.

If you definitely are looking to expand your monitoring approach and want to stick with open source tools then I would start by looking at NAGIOS if your focus is on infrastructure, or there are quite a few web service monitoring solutions with Free Tiers such as pingdom:

Mick
  • 24,231
  • 1
  • 54
  • 120