High CPU system time of unknown nature

Question

Environment:
Intel Server Board S2600GZ
2 x Intel Xeon CPU E5-2620
128GB DDR3 RAM
Intel RAID Controller RS25DB080 (LSI SAS2208) with four ST2000NM0033-9ZM175 SATA disks
Ubuntu 12.04.5 LTS / Linux 3.11.0-26-generic x86_64

We have a 4TB hardware RAID10 volume built at the aforementioned controller and an Ubuntu Server OS installed on it. This server is a "hot standby" under a minor load (moderately active GlusterFS replica brick and a few backup KVM/qemu VMs).

When the disk load increases (some VMs grab the primary role, get restarted or GlusterFS volume activity increases) we sometimes get a burst of CPU system time and high load average values. Neither htop, nor iotop reveal the culprit. irq and softirq values are normal. Usually we try to decrease the disk load and eventually the CPU system time slowly becomes normal. But only until all of it happens again.

We actually suspect the storage subsystem, but can't figure out what exactly is faulty. MegaCli -PDList -aALL reports no problems with disks, MegaCli -AdpEventLog -GetSinceReboot -f lsi-events.log -aALL reports no typical errors, the volume state is always optimal. smartctl also reports no S.M.A.R.T. issues with any of the hard disks. The situation keeps reappearing for already more than six months, none of the reports described above had changed - all systems appear to be healthy.

So, here's the questions. Is there any tiny chance that the described troubles could be caused by the faulty RAID controller? Or it is more likely one of the disks is dying and both its S.M.A.R.T. subsystem and the controller firmware mysteriously can't detect it? How could we identify the disk in the latter case? Or how could we confirm it's the controller's fault so replacing it would be warranted? Maybe any other suggestions?

score 1 · Answer 1 · answered Nov 20 '15 at 10:19

Really????

I got the same problem 2 years a go on 2 servers so I didn't trust to use the internal raid controller for this and after one week I choosed to scratch and reinstall both using software raid (you are always safe). After 2 years no problem with that they works perfectly. Of course my customer spent a lot of money for nothing , but I wasn't agree with him about the choice from the beginning I use to work with other HW vendors.

take a look..

dmidecode -t 2

SMBIOS 2.6 present.

Handle 0x0002, DMI type 2, 15 bytes
Base Board Information
Manufacturer: Intel Corporation
Product Name: S2600GZ
Version: G11481-354
Serial Number: QSGR34501185
Asset Tag: ....................
Features:
    Board is a hosting board
    Board is replaceable
Location In Chassis: To be filled by O.E.M.
Chassis Handle: 0x0003
Type: Motherboard
Contained Object Handles: 0

Yes, our output for mobo mostly matches, but we use _external_ RAID controller. Also, we can't afford re-installing everything since the GlusterFS volume is too large to re-sync from scratch. — Jacob Becker, Nov 20 '15 at 10:33
Of course you can't do that.. my scenario it's a bare metal node connected to external storage, but it's strange that happens the same in two times in different place of the world, maybe it's a bus problem? Try to ask to intel if you can. — Francesco P, Nov 20 '15 at 11:10

High CPU system time of unknown nature

1 Answers1