Environment:
Intel Server Board S2600GZ
2 x Intel Xeon CPU E5-2620
128GB DDR3 RAM
Intel RAID Controller RS25DB080 (LSI SAS2208) with four ST2000NM0033-9ZM175 SATA disks
Ubuntu 12.04.5 LTS / Linux 3.11.0-26-generic x86_64
We have a 4TB hardware RAID10 volume built at the aforementioned controller and an Ubuntu Server OS installed on it. This server is a "hot standby" under a minor load (moderately active GlusterFS replica brick and a few backup KVM/qemu VMs).
When the disk load increases (some VMs grab the primary role, get restarted or GlusterFS volume activity increases) we sometimes get a burst of CPU system time and high load average values. Neither htop
, nor iotop
reveal the culprit. irq and softirq values are normal. Usually we try to decrease the disk load and eventually the CPU system time slowly becomes normal. But only until all of it happens again.
We actually suspect the storage subsystem, but can't figure out what exactly is faulty. MegaCli -PDList -aALL
reports no problems with disks, MegaCli -AdpEventLog -GetSinceReboot -f lsi-events.log -aALL
reports no typical errors, the volume state is always optimal
. smartctl
also reports no S.M.A.R.T. issues with any of the hard disks. The situation keeps reappearing for already more than six months, none of the reports described above had changed - all systems appear to be healthy.
So, here's the questions. Is there any tiny chance that the described troubles could be caused by the faulty RAID controller? Or it is more likely one of the disks is dying and both its S.M.A.R.T. subsystem and the controller firmware mysteriously can't detect it? How could we identify the disk in the latter case? Or how could we confirm it's the controller's fault so replacing it would be warranted? Maybe any other suggestions?