Low random read IOPs for 5 SSDs in hardware RAID5 setup

Question

We have 5 Toshiba Px04Srb192 SSDs (270K random read IOP as per spec), setup in hardware Raid5. Running fio is giving 250K IOPs which is much below what I was expecting.

/fio --randrepeat=1 --ioengine=libaio --direct=1 --gtod_reduce=1 --name=test --filename=/ae/disk1/test --bs=4k --iodepth=96 --numjobs=1 --size=8g --readwrite=randread

test: (g=0): rw=randread, bs=4K-4K/4K-4K, ioengine=libaio, iodepth=96
fio-2.0.9
Starting 1 process
Jobs: 1 (f=1): [r] [100.0% done] [961.6M/0K /s] [246K/0  iops] [eta 00m:00s]
test: (groupid=0, jobs=1): err= 0: pid=342604: Tue Feb 12 23:58:01 2019
  read : io=8192.0MB, bw=991796KB/s, iops=247948 , runt=  8458msec
  cpu          : usr=10.88%, sys=87.74%, ctx=437, majf=0, minf=115
  IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%, >=64=100.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.1%
     issued    : total=r=2097152/w=0/d=0, short=r=0/w=0/d=0

Run status group 0 (all jobs):
   READ: io=8192.0MB, aggrb=991795KB/s, minb=991795KB/s, maxb=991795KB/s, mint=8458msec, maxt=8458msec

Disk stats (read/write):
  sdb: ios=2083688/0, merge=0/0, ticks=265238/0, in_queue=265020, util=98.53

lspci -nn | grep RAID
18:00.0 RAID bus controller [0104]: LSI Logic / Symbios Logic MegaRAID Tri-Mode SAS3508 [1000:0016] (rev 01)

I was expecting 5 SSDs IOPs to be at least 2 times individual SSD. Is that right? Any suggestions why we are seeing low IOPs?

I suspect the RAID controller is the bottleneck. Can you get its perf stats? Otherwise, start with a single drive and see if you can even get 270K iops — Mark Wagner, Feb 14 '19 at 00:58

score 1 · Answer 1 · answered Feb 14 '19 at 05:02

1

First, you expect 270k iops as per spec, but what is the block size required to meet this value ? is it actually 4k io size in the spec ?

Second, if you use a single io thread to benchmark your raid5, you will never be able to show the overall read performance of your raid5. Each read io is served by only 1 SSD drive. So you have to increase the workers count (fio numjobs parameter) to at least 5.

answered Feb 14 '19 at 05:02

Chaoxiang N

1,283
5
11

1

While I agree that increasing `numjobs` may help reach higher "hero" numbers I'm not sure the statement "Each read io is served by only 1 SSD drive" covers the entire situation. The question author is using `libaio` with `direct=1` with `iodepth=96`. That means there can be up to 96 outstanding I/Os at any given time split across all the disks (i.e. we have asynchronous submission from a single thread/process). If each SSD only had an internal queue depth of only 19 or the RAID controller's maximum depth is only 96 then what was done would be already at the limits... – Anon Feb 14 '19 at 06:31
yes, 270 K random reads for 4k. I tried increasing numjobs but IOPS don't go beyond ~250K – user1959200 Feb 14 '19 at 17:03

Anon · Answer 2 · 2019-02-14T06:41:13.557

(Your version of fio is ancient! See https://github.com/axboe/fio/releases to see what upstream has reached...)

The feedback you're getting in other answers is good but I'd like to highlight this:

  cpu          : usr=10.88%, sys=87.74%, ctx=437, majf=0, minf=115

If we sum your userspace and kernel system percentages together we get 98.62%. There is a strong suggestion that you have no more CPU time left to send more I/Os (note you're already using the go-faster stripe of gtod_reduce=1 which I normally recommend against but it looks appropriate in your case).

There are a few other things though...

  sdb: ios=2083688/0, merge=0/0, ticks=265238/0, in_queue=265020, util=98.53

This is hinting that the "disk" your RAID controller is presenting is very busy (look at that util percentage). That is something to bear in mind.

Are you doing I/O through a file within a filesystem (/ae/disk1/)? If so, are you aware that the filesystem will impose some overhead and may not offer the O_DIRECT behaviour you're expecting? You probably want to start by doing I/O at the block level (i.e. /dev/sdb) and work your way up to be able to attribute what any overhead is (WARNING: be careful - fio can destroy data when misused).

If you're really going to go faster I think you will need to:

Do I/O at the block device level.
Use multiple threads or processes (e.g. by increasing numjobs). That way the fio threads/processes are likely to migrate to different CPUs (but note everything comes with a cost)...
Start tweaking fio to submit and reap I/O in batches.
Start tweaking your kernel.

As I stated it is rare that most people need to go to these lengths but maybe you're one of the exceptions :-). An fio mailing list thread reply "Re: Recommended Job File For Stress Testing PCI-E" mentions this:

You might see a benefit (more load per disk) using threads (thread) rather than processes. You may also need to use more than one thread per disk. See http://fio.readthedocs.io/en/latest/fio_doc.html for more options. Both https://www.spinics.net/lists/fio/msg05451.html and http://marc.info/?l=linux-kernel&m=140313968523237&w=2 give examples of people using fio to drive high load so those might be more useful.

Thanks Anon for detailed explanation. Let me try your recommendations. — user1959200, Feb 14 '19 at 17:08

Low random read IOPs for 5 SSDs in hardware RAID5 setup

2 Answers2