High Iowait + High load average on a monitoring server

Question

I've a nagios server which was perfectly working up to a few days ago. I've stopped it and restarted it to increase its RAM, and since then, iowait increased dramatically on the server (more than 20%, it was less than 1% before). I've tried to put back the original amount of RAM on the server but I still get the same issue.
I've readed lots of similar iowait problems on serverfault, but I never manage to find the explaination in my case :
Looking at iotop, I see there is a lot of io for pdflush, which is doing page cache & kjournald, which is dedicated for journaling ext3 filesystem. I don't know if it's normal. According to other serverfault questions, i've tried to put noatime in fstab. Ext3 filesystem is mounted with ordered data mode

Total DISK READ: 0.00 B/s | Total DISK WRITE: 210.44 K/s
  TID  PRIO  USER     DISK READ  DISK WRITE  SWAPIN     IO>    COMMAND
  650 be/3 root        0.00 B/s    0.00 B/s  0.00 % 99.99 % [kjournald]
11482 be/4 root        0.00 B/s    0.00 B/s  0.00 % 98.42 % [pdflush]
12167 be/4 nagios      0.00 B/s    0.00 B/s  0.00 %  0.12 % nagios -d /srv/eyesofnetwork/nagios-3.4.1/etc/nagios.cfg
   11 rt/3 root        0.00 B/s    0.00 B/s  0.00 %  0.10 % [migration/3]
12168 be/4 nagios      0.00 B/s    0.00 B/s  0.02 %  0.08 % nagios -d /srv/eyesofnetwork/nagios-3.4.1/etc/nagios.cfg
12165 be/4 nagios      0.00 B/s    0.00 B/s 98.42 %  0.02 % nagios -d /srv/eyesofnetwork/nagios-3.4.1/etc/nagios.cfg
 2600 be/3 root        0.00 B/s    0.00 B/s  0.00 %  0.02 % auditd
12164 be/4 nagios      0.00 B/s    0.00 B/s  0.00 %  0.00 % nagios -d /srv/eyesofnetwork/nagios-3.4.1/etc/nagios.cfg
    8 rt/3 root        0.00 B/s    0.00 B/s  0.00 %  0.00 % [migration/2]
   20 rt/3 root        0.00 B/s    0.00 B/s  0.00 %  0.00 % [migration/6]
   26 be/3 root        0.00 B/s    0.00 B/s  0.00 %  0.00 % [events/0]
   23 rt/3 root        0.00 B/s    0.00 B/s  0.00 %  0.00 % [migration/7]
 3047 be/4 root        0.00 B/s    0.00 B/s  0.00 %  0.00 % snmpd -Ln -Lf /dev/null -p /var/run/snmpd.pid -a
12169 be/4 nagios      0.00 B/s    0.00 B/s  0.12 %  0.00 % nagios -d /srv/eyesofnetwork/nagios-3.4.1/etc/nagios.cfg
   14 rt/3 root        0.00 B/s    0.00 B/s  0.00 %  0.00 % [migration/4]
 2601 be/3 root        0.00 B/s    0.00 B/s  0.00 %  0.00 % auditd
    5 rt/3 root        0.00 B/s    0.00 B/s  0.00 %  0.00 % [migration/1]
   17 rt/3 root        0.00 B/s    0.00 B/s  0.00 %  0.00 % [migration/5]
 5228 be/4 root        0.00 B/s    0.00 B/s  0.00 %  0.00 % bash
   10 rt/3 root        0.00 B/s    0.00 B/s  0.00 %  0.00 % [watchdog/2]
   13 rt/3 root        0.00 B/s    0.00 B/s  0.10 %  0.00 % [watchdog/3]

the following line

 12165 be/4 nagios      0.00 B/s    0.00 B/s 98.42 %  0.02 % nagios -d /srv/eyesofnetwork/nagios-3.4.1/etc/nagios.cfg

seems quite surprizing : how can I have 98.42% of swapin since I have almost no swap :

free -o
             total       used       free     shared    buffers     cached
Mem:       4046468    3163796     882672          0     103548    2193604
Swap:      4192956       1572    4191384

top don't show something specific, exept high load and high iowait

top - 10:07:56 up 12 days, 23:42,  4 users,  load average: 8.60, 9.29, 9.85
Tasks: 177 total,   1 running, 176 sleeping,   0 stopped,   0 zombie
Cpu(s):  0.1%us,  0.0%sy,  0.0%ni, 77.2%id, 22.6%wa,  0.0%hi,  0.0%si,  0.0%st
Mem:   4046468k total,  3165500k used,   880968k free,   104204k buffers
Swap:  4192956k total,     1572k used,  4191384k free,  2201500k cached
  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND            
 5246 root      15   0 14252 2632  836 R  0.3  0.1   0:03.94 top                
    1 root      15   0 10372  696  584 S  0.0  0.0   0:03.61 init               
    2 root      RT  -5     0    0    0 S  0.0  0.0   0:14.80 migration/0        
    3 root      34  19     0    0    0 S  0.0  0.0   0:00.73 ksoftirqd/0        
    4 root      RT  -5     0    0    0 S  0.0  0.0   0:00.00 watchdog/0         
    5 root      RT  -5     0    0    0 S  0.0  0.0   0:13.93 migration/1        
    6 root      34  19     0    0    0 S  0.0  0.0   0:01.75 ksoftirqd/1        
    7 root      RT  -5     0    0    0 S  0.0  0.0   0:00.00 watchdog/1         
    8 root      RT  -5     0    0    0 S  0.0  0.0   0:09.51 migration/2        
    9 root      34  19     0    0    0 S  0.0  0.0   0:01.09 ksoftirqd/2        
   10 root      RT  -5     0    0    0 S  0.0  0.0   0:00.00 watchdog/2         
   11 root      RT  -5     0    0    0 S  0.0  0.0   0:08.98 migration/3        
   12 root      34  19     0    0    0 S  0.0  0.0   0:01.46 ksoftirqd/3        
   13 root      RT  -5     0    0    0 S  0.0  0.0   0:00.00 watchdog/3         
   14 root      RT  -5     0    0    0 S  0.0  0.0   0:20.36 migration/4        
   15 root      34  19     0    0    0 S  0.0  0.0   0:01.15 ksoftirqd/4        
   16 root      RT  -5     0    0    0 S  0.0  0.0   0:00.00 watchdog/4

disabling nagios process make the system load normal (i.e. < 1 ) but i still get high iowait.

In atop, the DSK is 100% busy, even with no nagios process running. May I have a hard drive problem? (it's a western digital green, which is not supposed to be running in such a server). I get no special message on dmesg or syslog.

How many disks? just one? – Tom O'Connor Jul 15 '13 at 10:03 — Tom O'Connor, Jul 15 '13 at 10:03

score 2 · Accepted Answer · answered Jul 15 '13 at 10:06

2

Oh, I'm sorry. Are you using a WD Green disk for something other than a desktop PC?

Don't.

They're slow, unreliable (they'll go to sleep and drop out of a RAID array), and totally unsuitable for what you want to do.

If you're experiencing high IOWait, that means the disk subsystem isn't able to handle the amount of disk IO that's required.

The easy way to resolve that is to add more disks (Ideally a whole bunch in a RAID6 array).

You should also check general disk health with smartctl, and take a backup (should do this regularly anyway, but if you've got an over-used WD Green, I'd be extra cautious.).

answered Jul 15 '13 at 10:06

Tom O'Connor

27,480
10
73
148

Backups are done every days, so i'm quite confident on this. My problem is that I didn't notice something wrong doing smartctl, no errors, load cycle count don't increase without limits. I've already scheduled to swap this wd green drive but i'd like to be sure that this is the **only** origin of the problem. (fyi There is no raid on the server). – Golgot Jul 15 '13 at 10:33
Ever tried a restore? ;) – Tom O'Connor Jul 15 '13 at 11:12
1

There's never one origin of a problem. There's always something that's not quite perfect, but things rapidly snowball. – Tom O'Connor Jul 15 '13 at 11:12
1

It was clearly the hard drive : the system is now migrated to a new server, and there is no more issues. On the old server, I've made some testing : writing to a file increased the iowait for hours (even long after the writing) and then slowly go down... – Golgot Jul 30 '13 at 13:09

score 0 · Answer 2 · answered Jul 15 '13 at 08:57

0

use swapoff and swapon command to clear the swap. After this stop the nagios and check if any pid still running use ps -ef|grep nagios now start the nagios once again.

The below command will tell which partition the swap fs has

swapon -s

swapoff /dev/sdaN

swapon /dev/sdaN

answered Jul 15 '13 at 08:57

Sharad Chhetri

129
4

I disabled the swap, without any success : still get the same iowait, I still see 99% swapin (bash process), I don't know how is it possible. – Golgot Jul 15 '13 at 09:13
stop the nagios and check with iotop. let us know what happen now. What are other services running in this server? Is there any nfs mount in server – Sharad Chhetri Jul 15 '13 at 15:53

High Iowait + High load average on a monitoring server

2 Answers2