1

I have a Rackspace cloud server running Ubuntu with 2GB memory that is being used as an application server (the html & php files are loaded from this server and the mysql database is on another server in the same datacenter).

When the number of users of my webapp increase (10,000+/day), the load goes up to 1.00 and sometimes 2.00. This makes sense logically, but I cannot find where the bottleneck is coming from. Using the "top" command, I see that the CPU usage is near 1% almost all of the time, and it only uses about 500 MB of the 2 GB memory total (almost all for apache processes). I also have munin installed and it appears that these numbers are roughly accurate for the entire day (there are no major spikes in either statistic).

If it is not CPU or memory that is the problem, than what should I monitor and/or optimize to prepare for larger traffic? (I don't know what to improve since I don't know the cause of the load!)

Thanks! Please let me know if you need any other info about my server setup.

eric
  • 131
  • 1
  • 2

3 Answers3

2

"Load" comes from more than just cpu utilization. It's the number of processes that are waiting for resources.

The first thing you need to do is figure out if this is having any impact to your application you're serving. A load of less than the number of cpu's you have is normally considered good.

When you're seeing this what does top say about your iowait?
What does free -m show?

You may also want to have a look at iostat.

3dinfluence
  • 12,449
  • 2
  • 28
  • 41
  • @3dinfluence: Thanks for the answer. I'm afraid of my server crashing once more users visit (which could be very soon). free -m shows 400 mb used, 1600 mb free. top shows a load of 1.2. I'm not sure what you mean about what top says about iowait. Do you mean the number of tasks? top says that there are 82 total tasks, 1 running (presumably the top command), and 81 sleeping. I installed iostat earlier, but it is very hard to decipher. Are there specific numbers I should be looking for in iostat? – eric Dec 03 '10 at 01:21
  • Next line down in top will show what you need. `Cpu(s): 5.2%us, 2.9%sy, 0.0%ni, 91.6%id, 0.3%wa, 0.0%hi, 0.0%si, 0.0%st` us = user processes, sy = system processes, ni = nice, id = idle, wa = IO Wait, hi = hard interrupt, si = soft interrupt, st = steal time – 3dinfluence Dec 03 '10 at 01:53
  • My guess is your bottleneck is disk io associated with mysql. If iostat or top show iowait then I would try iotop. This will tell you what processes are using disk io. If that turns out to be the case then you may be able to tune mysql or optimize your application/database to be more efficient. Other options would include memcached and other caching approaches. But how to optimize things depends on your application. – 3dinfluence Dec 03 '10 at 01:57
  • This is what it says: Cpu(s): 0.3%us, 0.2%sy, 0.0%ni, 75.9%id, 23.6%wa, 0.0%hi, 0.0%si, 0.0%st . Actually I was monitoring top for 5 minutes, and it appears that most of the time the %id is 99%, and every once in a while the %id drops to 75% like in the statistics i gave. Does that mean that the cause is disk io? I installed iotop, and the only thing that causes disk writes is something called "kjournald" and also "apache2 -k start" every 10 seconds maybe with a disk write speed of 60 kb/s during those writes (normally the speed is 0b/s). (FYI I use memcached and apc on the server). – eric Dec 03 '10 at 02:14
  • None of that sounds abnormal to me. Not much disk IO happening. At least not at the moment. Just looks like normal buffers been flushed. – 3dinfluence Dec 03 '10 at 02:39
  • thanks again for your help so far. ok, so cpu, memory, and disk io seem to have been ruled out as the cause of the high load? hmm, what else could cause the problem? do network bandwidth limits ever cause a problem of high load? – eric Dec 03 '10 at 03:00
  • I'm not sure I would call anything ruled out at this point. Your iowait seems to be elevated for some reason. But that doesn't always mean that you have a problem. I think you should be looking more at metrics that directly affect your application. Like average response time from apache or mysql query times. – 3dinfluence Dec 03 '10 at 04:08
  • thanks. the webpages load quickly when the server load is less than 2.0, so it is difficult to determine exactly what the problem is using webpage loading times or mysql query times. i was hoping there was more of a linux-y way to find the problem. it seems like the only time i'll be able to diagnose the problem is when the traffic starts increasing, and things start going really wrong :\ – eric Dec 03 '10 at 04:56
  • 1
    btw, i wish i could give you like 40 votes for your help so far, but i don't have enough points to vote :\ – eric Dec 03 '10 at 04:59
  • Well you could do load testing on your application. There's a lot of frameworks available to do this sort of thing. That way you don't have to wait for that to happen. I would do this to a different virtual machine though and not your production one. Should be pretty easy to clone your production machine in the cloud and have a development virtual machine to pound on. Since the charges are per hour and bandwidth used you won't have to make any long term commitment to running multiple servers and the cost should be pretty low. – 3dinfluence Dec 03 '10 at 18:46
2

Processes can be in one of several states in the Linux scheduler. Newer kernels have some fancy ones, but the basics are (from include/linux/sched.h):

#define TASK_RUNNING            0
#define TASK_INTERRUPTIBLE      1
#define TASK_UNINTERRUPTIBLE    2
#define TASK_STOPPED            4

The first should be obvious; the last is tasks that have been actually halted. The interruptible state is for tasks that are sleeping. Uninterruptible tasks are usually waiting on a system resource -- like disk or other IO.

Presumably because uninterruptible tasks are usually expected to be scheduled very soon, they're counted as being in the run queue.

And the loadavg numbers you see in /proc/loadavg (and in top and other tools) simply are the average size of that run queue -- the processes waiting to be scheduled -- over 1, 5, and 15 minute intervals. If you've got a lot of processes actually in TASK_RUNNING, that'll drive up the loadavg, but processes stuck in TASK_UNINTERRUPTIBLE will do it too. (In fact, in my experience, that's usually the culprit behind ridiculously high load numbers.)

So, if you're seeing high load without much CPU usage, you want to look for io. iotop is a handy tool for this. This requires kernel 2.6.20, though. On older systems, or just for an alternate view, iostat (from the sysstat package) and vmstat (from procps) can show some general statistics. Alternately, if you're using NFS, a stuck process may actually be doing very little real io but still get jammed up. (Yay NFS.)

If you're not seeing any of that, something may be going awry in the virtual machine infrastructure.

mattdm
  • 6,600
  • 1
  • 26
  • 48
  • thanks for the answer, but I'm confused how I can use it to help diagnose my problem! should I be monitoring something? – eric Dec 03 '10 at 02:17
  • @added more re that. :) – mattdm Dec 03 '10 at 02:35
  • i looked at iotop (see comments in @3dinfluence's answer), but it doesn't seem to suggest any answers. any other ideas? i'll contact rackspace and see what they say. – eric Dec 03 '10 at 05:01
  • One data point you can collect is the very simple `hdparm` benchmark. Run `hdparm -tT /dev/sda` (or whatever your drive is, or on all of them if you have more than one). Run it several times and disregard the first couple. This needs to be run as root but is nondestructive. (Beware though that hdparm does have some dangerous options.) – mattdm Dec 03 '10 at 13:51
0

Monitor disk I/O operations and the sizes of the disk I/O operations.

This will tell you

  1. Where the bottlenecks are (e.g. many read I/O operations / second, or a few very large writes.), which will tell you
  2. What changes you should make to improve performance (e.g. change to a RAID10 array, or switch to SSDs).

I'm not sure what control you have over the disk configuration in your environment, but it does seem like that's your bottleneck.

Slartibartfast
  • 3,295
  • 18
  • 16