Horrible performing RAID

Question

I have a small GlusterFS Cluster with two storage servers providing a replicated volume. Each server has 2 SAS disks for the OS and logs and 22 SATA disks for the actual data striped together as a RAID10 using MegaRAID SAS 9280-4i4e with this configuration: http://pastebin.com/2xj4401J

Connected to this cluster are a few other servers with the native client running nginx to serve files stored on it in the order of 3-10MB.

Right now a storage server has a outgoing bandwith of 300Mbit/s and the busy rate of the raid array is at 30-40%. There are also strange side-effects: Sometimes the io-latency skyrockets and there is no access possible on the raid for >10 seconds. The file system used is xfs and it has been tuned to match the raid stripe size.

Does anyone have an idea what could be the reason for such a bad performing array? 22 Disks in a RAID10 should deliver way more throughput.

I suspect your client is calling sync() a lot, which forces the array to wait until it's committed the write and disables write caching. Do you have a NVRAM write cache anywhere? — pjc50, Apr 13 '12 at 09:33
@pjc50 For testing I've disabled all write access on the servers. This also happend when there was no write at all on the RAID. — Philip, Apr 13 '12 at 09:39
@pfo I did not but it looks like I should to this. However, as stated in my previous comment this also happens if there are no writes at all. :/ — Philip, Apr 13 '12 at 09:45
@pfo Just upgraded today to 3.2.13 with grsec. Before that I was running 2.6.38 also with grsec. It happened on both kernels. — Philip, Apr 13 '12 at 09:56
How large is the RAM of the machine and what are the values for vm.{dirty_ratio,dirty_background_ratio,dirty_writeback_centisecs}? — pfo, Apr 13 '12 at 10:28
Also use blktrace(8) to see if your CDBs contain a lot of FUA or SCSI_CACHE_SYNCHRONIZE commands sent to the device. — pfo, Apr 13 '12 at 10:30

score 3 · Answer 1 · answered Apr 13 '12 at 09:29

3

Is someone shouting to your hard drives? :-)

More seriously: is there lots of write activity during I/O latency spikes? Have you tried to use iotop and/or btrace to see what's going on under the hood?

Perhaps the RAID controller flushes its cache during the spikes and blocks everything until it completes?

answered Apr 13 '12 at 09:29

Janne Pikkarainen

31,852
4
58
81

Maybe I should ask for a webcam in the datacenter :-) I have disabled all writes on the system for testing purposes. There still were this latency spikes :/ – Philip Apr 13 '12 at 09:46
Are those spikes happening at predictable times (like, once an hour) or just randomly? – Janne Pikkarainen Apr 13 '12 at 09:48
I haven't noticed any patterns yet. The strange thing is that the freqency of these spikes don't increase with more load. – Philip Apr 13 '12 at 10:02
Next thing you tell me you have Western Digital Caviar Green drives (which tend to have such odd performance problems due brain-dead power-saving features etc). – Janne Pikkarainen Apr 13 '12 at 10:04
I have Seagate drives. – Philip Apr 13 '12 at 10:36

score 0 · Answer 2 · answered Apr 13 '12 at 16:39

If you can log a spike, we'd have more to work with. Either way, with no glaring configuration issues, I'm reasonably confident in saying that this is probably a hardware issue. I'd start by replacing the card, and then maybe the disks if they're under warranty.

Horrible performing RAID

2 Answers2