What would cause an average load of 10-30 (rather than 10-30%)

Question

Possible Duplicate:
How to understand the memory usage and load average in linux server

I'm not sure whether this would be better titled "Why would Nagios need to monitor a load reaching 30".

Situation: I am setting up Nagios for our network and have reached the stage of setting up NRPE on the *nix boxes. I had already (on paper) gotten a rough idea of where I wanted notifications set up. For a particular server, as an example, it looks like this: 1 minute: warn at 90%, crit at 100% 5 minutes: warn at 80%, crit at 90% 15 minutes: warn at 60%, crit at 70%

The server runs two virtual cpus so I plan to use the -r parameter to get a per-cpu result (yeah I know this isn't really per cpu, it's the load for all of them divided by the number of them and I am OK with that).

so I was absolutely ready to set this up, when I saw the defaults on the NRPE config file:

command[check_load]=/usr/lib/nagios/plugins/check_load -w 15,10,5 -c 30,25,20

This put me off. I started wondering if I really understand load averages. I see that the -r parameter is not used and so load averages above 1 are normal, but does this suggest the default there is for a 30-cpu system? I saw this question for which the answer suggests using [number of cpu's] * 10 for the critical 5 minute notification (one minute maybe?) which further supports the use of values far higher than I planned. I mean, without seeing the defaults there I would have gone with

command[check_load]=/usr/lib/nagios/plugins/check_load -r -w 0.9,0.8,0.6 -c 1.0,0.9,0.7

but now I am doubtful. I know that no one from the internet can tell me the correct values to use for our situation and I do not expect anyone to, I would be very thankful if someone can tell me whether or not I grossly misunderstand load and need to start my detective work on useful values again. For what it is worth, I got those values just based on having run top every once in a while for the past 6 months on the server in question. Usually it sits between .4 per cpu (.8) and .55 per cpu (1.1) for 1 minute avg.

There are fault conditions that can cause far far higher load averages... IIRC ca. 880 on a single cpu, single core system is the highest I personally saw happen in a non-test environment. To replicate in the lab: spawn a few thousand processes that will halt on a D-stated resource, then make the D state go away at once (easy to do with hard mounted NFS :) — rackandboneman, Jan 14 '13 at 07:18

score 0 · Answer 1 · answered Jan 14 '13 at 11:24

The raw load average numbers are just numbers, not a percentage of any absolute. Load average and CPU utilization (which is usually expressed as a percentage) are not the same thing. You should monitor both.

An approximate description of load average (on Linux at least) is "the number of processes that could run", it's very dependent on what your systems do. Rule of thumb is that 1 load unit per CPU is "busy", which explains the check_load -r parameter. High I/O and short-lived processes can really mess that up. You can find better descriptions elsewhere.

To answer your question: A load of 30 could be caused by 30 processes or threads all ready to run your CPUs flat out, with no sleeps/polls.

Good job for running top and having a feel for your load, those are the numbers you should start with, and tune those over time to minimise false alerts, though I would suggest doubling your critical thresholds.

IMHO the nrpe.cfg sample values are too high for a typical server workload. My guess is that they are sufficiently high to not cause a constant stream of "NRPE tells my me load average is too high all the time" questions. Oddly, check_load itself has defaults of 0,0,0 and 0,0,0.

What would cause an average load of 10-30 (rather than 10-30%)

1 Answers1