4

I saw a lot of monitoring tools and mostly show the same things. But I'm wondering if it's really necessary to watch all of these things.

I would like to know which metrics really matter for example a web server that mainly run a website, with PHP-FPM, Nginx and MySQL.

And also, I'm looking into graphics, but how to understand them and analyze them to prevent any future failure ?

Jacob
  • 9,204
  • 4
  • 45
  • 56
yvan
  • 141
  • 1
  • 3

5 Answers5

5

The metrics that matter are those which:

  • Indicate a problem with the correct and proper operation of the services you provide; or
  • Indicate the root cause of a problem

What metrics matter to you depends on what you judge, in your professional opinion, to be the metrics that best fulfil those two criteria. If you don't have the expertise to be able to accurately judge that in advance, well... yeah. Collecting more data that you may never need is better than not collecting some data which you turn out to need later. (The caveat there is that if your monitoring is starting to interfere with the efficient operation of the service, you might need to turn it down a bit, or optimise the statistics collection).

If you're looking for a short-cut answer, I'm afraid I don't have one -- you're on a steep learning curve that speaks to the very heart of what it means to be a sysadmin. If you're in a situation where some downtime doesn't matter, great! you've got yourself a great learning opportunity. If you're going to end up getting sued or going out of business if this service doesn't run perfectly, you might want to find someone with more experience to give you one-on-one guidance and mentoring.

womble
  • 96,255
  • 29
  • 175
  • 230
  • On my case, I've not problem without downtime, as it almost not possible in the current structure to be down for me. With this question I'm trying to find what really matters, I'm not sure to monitor everything will help you. I'm sure that you've some key metrics that can help you to understand where you're going. – yvan Mar 05 '12 at 22:57
  • 1
    "as it almost not possible in the current structure to be down for me" <---- famous last words. – EEAA Mar 06 '12 at 04:51
  • @ErikA as nothing its in production yet so it normal ;-) – yvan Mar 06 '12 at 07:09
  • @yvan: No, actually, I monitor practically everything. I've got a set of about 40 metrics that get measured on *every* machine I comprehensively administer -- and that's *before* I start looking at monitoring user-facing services themselves. – womble Mar 06 '12 at 07:51
  • @womble Well, can you tell me why these 40 metrics are import for you and how you use them ? Because it's what I'm looking for. I just don't want install a program that give me thousand metrics without understanding why I need them. – yvan Mar 06 '12 at 13:43
3

I just wrote and published a guide on exactly this subject:

Allow me to summarize here: There are 3 main goals to think about when monitoring any sort of production system:

  1. Identify as many problems as possible;
  2. Identify those problems as early as possible; and
  3. Generate as few false alarms as possible (that means setting proper alerts)

And you want to do this by picking your metrics under the following framework:

  1. Monitor Potential Bad Things (things that could go wrong - this is often in the form of things that fill up / run out -- i.e. memory, disk, bandwidth)
  2. Monitor Actual Bad Things (things that do go wrong despite your best efforts)
  3. Monitor Good Things (or the lack thereof - pay attention to things you want to happen and set an alert when they happen less-frequently
  4. Tune and Improve (otherwise you risk "alert fatigue" aka the DevOps equivalent of "crying wolf")

Every deployment is going to be a bit different so YMMV, but this is the framework that lots of seasoned pros use to think about things (whether explicit or not).

[Edit for disclosure: I'm affiliated with Scalyr, a company that is involved in this space, and the link above is published on their site]

nlh
  • 209
  • 2
  • 6
1

The most basic is to keep an eye on amount of CPU load, free memory & swap, disk space, disk I/O, and network/bandwidth I/O. This can be done using tools like munin or collectd. Some people like to monitor a lot of things, but if you keep it simple at least you can get the overall picture. I also recommend that you configure the monitoring tools to send you email alerts when things start to go wrong (ie using "thresholds" or similar).

Another very useful thing is to keep an eye on the most important server logs for anything unusual, ie error messages or perhaps even warnings. But such messages can be very common depending on how the various softwares are configured to log. Usually, daemons have a config file where you can change the "LogLevel" from error (=only log when something is broken) to debug (=log anything). Check which demons you have running on your server, and change the log levels to error or warning. Then you can install a log file analysis tool such as OSSEC and train it to be silent when certain things are acceptable while it should alert when things are broken. These alerts can be sent via email to you.

For your specific services Nginx and Mysql, I recommend that you monitor their response time. This is good for two reasons: if you don't get a response at all, something is broken. ANd if you get a response that indicates an unusually high response time - especially if it's not temporary but over a period of say a few minutes or hours - then the service is struggling.

öde
  • 167
  • 4
0

I would recommend you take a look at collectd. It can be configured to log numerous measurements into RRD-files for later analysis. It requires very little CPU and will help you to understand how your performance changes with load.

I have not found a truly awesome tool to actually draw graphs from the generated RRDs, but unless you want to project them in realime, just using rrdgraph on command line is typically enough to periodically check for big changes.

Bittrance
  • 3,070
  • 3
  • 24
  • 27
  • Thanks, but I'm not looking for a tool, but more for best practices for monitoring. I would like to have a kind of list with must be monitored and why and how to understand what is going on – yvan Mar 05 '12 at 22:58
  • "It requires very little CPU" -- I took it off a bunch of servers a while ago because it was taking half a core. pnp4nagios is a far more lightweight option, and does a better job of presenting the graphs, too. – womble Mar 06 '12 at 07:53
0

Excellent advice above. But if you really just need to get started, watch the basics at first: CPU usage over time, memory usage over time, bandwidth usage and disk space use (or free disk space). Those four are very common because they pretty much define the capabilities of a computer.

Once you've monitored for a while and know what 'normal' is for a server, you'll be able to spot when something is abnormal. That's when you're ready to start digging deeper and find out the 'why' -- which will require additional more specific monitoring :)

DougN
  • 670
  • 2
  • 7
  • 16