1

I have a small GlusterFS Cluster with two storage servers providing a replicated volume. Each server has 2 SAS disks for the OS and logs and 22 SATA disks for the actual data striped together as a RAID10 using MegaRAID SAS 9280-4i4e with this configuration: http://pastebin.com/2xj4401J

Connected to this cluster are a few other servers with the native client running nginx to serve files stored on it in the order of 3-10MB.

Right now a storage server has a outgoing bandwith of 300Mbit/s and the busy rate of the raid array is at 30-40%. There are also strange side-effects: Sometimes the io-latency skyrockets and there is no access possible on the raid for >10 seconds. The file system used is xfs and it has been tuned to match the raid stripe size.

Does anyone have an idea what could be the reason for such a bad performing array? 22 Disks in a RAID10 should deliver way more throughput.

Philip
  • 165
  • 3
  • 13
  • I suspect your client is calling sync() a lot, which forces the array to wait until it's committed the write and disables write caching. Do you have a NVRAM write cache anywhere? – pjc50 Apr 13 '12 at 09:33
  • Did you disable XFS's barriers? – pfo Apr 13 '12 at 09:39
  • @pjc50 For testing I've disabled all write access on the servers. This also happend when there was no write at all on the RAID. – Philip Apr 13 '12 at 09:39
  • @pfo I did not but it looks like I should to this. However, as stated in my previous comment this also happens if there are no writes at all. :/ – Philip Apr 13 '12 at 09:45
  • Which kernel version are you running? – pfo Apr 13 '12 at 09:52
  • @pfo Just upgraded today to 3.2.13 with grsec. Before that I was running 2.6.38 also with grsec. It happened on both kernels. – Philip Apr 13 '12 at 09:56
  • 1
    How large is the RAM of the machine and what are the values for vm.{dirty_ratio,dirty_background_ratio,dirty_writeback_centisecs}? – pfo Apr 13 '12 at 10:28
  • Also use blktrace(8) to see if your CDBs contain a lot of FUA or SCSI_CACHE_SYNCHRONIZE commands sent to the device. – pfo Apr 13 '12 at 10:30

2 Answers2

3

Is someone shouting to your hard drives? :-)

More seriously: is there lots of write activity during I/O latency spikes? Have you tried to use iotop and/or btrace to see what's going on under the hood?

Perhaps the RAID controller flushes its cache during the spikes and blocks everything until it completes?

Janne Pikkarainen
  • 31,852
  • 4
  • 58
  • 81
0

If you can log a spike, we'd have more to work with. Either way, with no glaring configuration issues, I'm reasonably confident in saying that this is probably a hardware issue. I'd start by replacing the card, and then maybe the disks if they're under warranty.

Basil
  • 8,851
  • 3
  • 38
  • 73