2

My server is experiencing very high load average spikes (>10, sometimes even >20) every few minutes.

top shows that CPU isn't in usage but is waiting for I/O operations:

top - 17:42:28 up 8 days,  8:10,  1 user,  load average: 9.01, 10.16, 6.54
Cpu0  :  0.0%us,  0.0%sy,  0.0%ni,  0.0%id,100.0%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu1  :  0.0%us,  0.0%sy,  0.0%ni,  0.0%id,100.0%wa,  0.0%hi,  0.0%si,  0.0%st

dmesg shows this output over and over again (I don't understand what it means):

ata2.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x2
ata2.00: BMDMA stat 0x26
ata2.00: cmd ca/00:08:74:c4:24/00:00:00:00:00/ef tag 0 dma 4096 out
         res 51/84:01:7b:c4:24/84:00:10:00:00/ef Emask 0x30 (host bus error)
ata2.00: status: { DRDY ERR }
ata2.00: error: { ICRC ABRT }
ata2: soft resetting link
ata2.00: configured for UDMA/33
ata2: EH complete
sd 3:0:0:0: [sdb] 625142448 512-byte hardware sectors (320073 MB)
sd 3:0:0:0: [sdb] Write Protect is off
sd 3:0:0:0: [sdb] Mode Sense: 00 3a 00 00
sd 3:0:0:0: [sdb] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA

Any ideas?

There is FreeRADIUS running on server and I am suspecting that either RADIUS either network adapter might be causing the problem. During some spikes tcpdump showed increased number of RADIUS packages being sent/received (but I'm talking about tens of packages per minute, not thousands of them).

When I stop RADIUS situation gets better, but there are still periodic load avg spikes (more tolerable though).

Does anybody have an idea what might be causing this behavior, and how I can determine for sure is it RADIUS, network adapter, or something else?

Thanks

celicni
  • 23
  • 2

2 Answers2

3

No, this is either one disk or a disk controller dying. This has nothing to do with the software you're running or the network adapter.

If you're not already doing backups - make one NOW and start to look for replacement hardware.

pauska
  • 19,620
  • 5
  • 57
  • 75
  • There are two disks in RAID1 and both disks have been changed recently (because of failure). They seem to have passed the SMART test. Raw_Read_Error_Rate is in "pre-fail", however I've tested it on some other servers and it was in "pre-fail" as well, so I'm not sure what it means. – celicni Dec 09 '11 at 18:12
  • It turned out /dev/sdb was failing and that caused the problem. Thanks for the answer! – celicni Dec 14 '11 at 13:36
0

An almost identical question like this has been posted on SU

Before you reboot or tinker with settings - perform backup (And parity check it!) ASAP.

thinice
  • 4,716
  • 21
  • 38