6

I've been tasked with setting up monitoring of 300 servers, doing different things. I've been looking at various tools, such as Nagios, Munin, and others - so I got a pretty good idea on how I can achieve monitoring in the first place.

What I'm wondering, is what metrics are usual to monitor as a good default in the event where I don't know much about the server? And, what are "sane defaults" as far as alerting go?

My plan is to deploy a monitoring scheme with sane defaults as a start, while I map out the roles of the different systems - which I expect will take some time.

This question can also be asked in a different way:

If you were designing a monitoring-appliance - what should its default Linux-monitoring template contain?

Kvisle
  • 4,193
  • 24
  • 25

4 Answers4

6

The usual metrics which indicate problems include cpu utilization, memory utilization, load average, and disk utilization. For mail servers, the size of the mail queue is an important indicator. For web servers, the number of busy servers is an important measure. Excessive network throughput also leads to problems. If you have processes which need to check times NTP can be an important tool in keeping clocks in sync.

Standard warning levels I have used include (warning, critical). You may want to adjust your values based on a number of factors. Higher values reduce the number of alerts, while lower values give you more time to react to developing problems. This might be a suitable starting point for a template.

  • Sustained CPU utilization (80%, 100%). Exclude time for niced processes.
  • Load average per CPU (2, 5).
  • Disk utilization per partition (80%, 90%).
  • Mail queue (10, 50). Use lower values on non mail servers.
  • Busy web servers (10, 25).
  • Network throughput (80%, 100%). Network backups and other such process may exceed values. I would use throttling settings if they are available.
  • NTP offset in seconds ( 0.2, 1).

Munin does a good job gathering these statistics and others. It also has the capability to trigger alarms when thresholds are passed. Its warning capabilities are not as good as those of Nagios. Its gathering and display of historical data makes it a good choice to be able to review whether the current values differ significantly from past values. It is easy to setup and can be run without generating warnings. The main problem is volume of data captured, and its fixed frequency of gathering information. You may want to generate graphs on demand. Munin provides many of the statistics I would check using sar when a system was in trouble. It's overview page is useful for identifying possible problems.

Nagios is very good at alerting, but has historically not been very good at gathering historical data in a manner suitable for comparison to current values. It appears this is changing and the new release is much better at gathering this data. It is a good choice for generating warnings when there are problems, and scheduling outages during which alerts are not generated. Nagios is very good at alerting when services go down. This is especially suitable for critical servers and services.

BillThor
  • 27,737
  • 3
  • 37
  • 69
  • cacti has a easier web interface and for has templates easily available, would be a good choice too. – Gaumire Nov 06 '11 at 05:15
  • This is the kind of answer I was hoping for! :) – Kvisle Nov 06 '11 at 11:44
  • @BillThor What does "Busy web servers (10, 25)" mean? – kkurian Jun 19 '17 at 12:59
  • @kkurian Warns if 10 web servers are processing requests, and alarms when 25 web servers are processing requests. Web servers usually respond very quickly so should not be processing a large number of concurrent requests. – BillThor Jun 19 '17 at 21:02
2

I would use Nagios if I were you, for a number of reasons (here's two of them):

  1. You can use "templates" and set up server groups, and monitor different "groups" with different metrics. For example, put all of your web servers in 1 group, put all your database servers in another group, etc...
  2. It's very easy to automate the alerts to go to email, etc... (and create an alert escalation in case the first on-call responder doesn't respond to the alert within a certain amount of time)

A 3rd reason is that Nagios already comes with a default monitoring schema, which takes care of most things you'd want to monitor across the board - so you wouldn't have to set up your own monitoring "metrics" to begin with.

But if I were setting up my own metrics, I would monitor across all servers things like: Server load, free disk space, free memory, swap space usage, and then I would also do some external monitoring with ICMP pings, etc...

quanta
  • 51,413
  • 19
  • 159
  • 217
David W
  • 3,453
  • 5
  • 36
  • 62
1

You can first monitor the system resources such as cpu and memory.

Then, you can monitor the service-specific resouces. For example, you can monitor the response time and number of active connections.

For the default monitoring values, I think it should be related to the expected usage pattern and how much you expect the server to be busy.

Khaled
  • 36,533
  • 8
  • 72
  • 99
1

In general to start with I would monitor, server load, cpu usage, memory, disk space and I/O and network traffic. Then depending on the type of server (web/mail/database/NIS) I would then monitor application specific stats and other vitals like interface errors, latencies and response time etc.

Gaumire
  • 825
  • 6
  • 13