I have a 30Tb-sized hardware RAID-6 system (LSI 9280-8e) of 10 DC-S4500 Intel SSDs that is used for database purposes. The OS Debian 7.11 with 3.2 kernel. The filesystem is XFS mounted with nobarrier option.
Seeing somewhat sluggish comparing to my expectations performance in random I/O, I started to investigate what's going on by running fio benchmarks. And to my surprise when I just used fio on 1Tb file in random-read settings with (iodepth=32 and ioengine=libaio) I get ~ 3000 IOPS which is much lower than what I was expecting.
random-read: (groupid=0, jobs=1): err= 0: pid=128531
read : io=233364KB, bw=19149KB/s, iops=4787 , runt= 12187msec
...
cpu : usr=1.94%, sys=5.81%, ctx=58484, majf=0, minf=53
IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=99.9%, >=64=0.0%
submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.1%, 64=0.0%, >=64=0.0%
issued : total=r=58341/w=0/d=0, short=r=0/w=0/d=0
However if I use direct=1 option (i.e. bypassing linux's buffer cache), I get ~ 40000 IOPS, which what I'd like to see.
random-read: (groupid=0, jobs=1): err= 0: pid=130252
read : io=2063.7MB, bw=182028KB/s, iops=45507 , runt= 11609msec
....
cpu : usr=6.93%, sys=23.29%, ctx=56503, majf=0, minf=54
IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=100.0%, >=64=0.0%
submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.1%, 64=0.0%, >=64=0.0%
issued : total=r=528291/w=0/d=0, short=r=0/w=0/d=0
I seem to have all the right settings for the SSD partition in the form of the scheduler, read-ahead and rotational setting.
root@XX:~# cat /sys/block/sdd/queue/scheduler
[noop] deadline cfq
root@XX:~# cat /sys/block/sdd/queue/rotational
0
root@XX:~# blockdev --getra /dev/sdd
0
Am I still missing something that lowers the buffered performance so much ? Or is it expected to see such a difference between DIRECT vs buffered ?
I also looked at iostat output during two runs This is when direct=1 was used:
Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await r_await w_await svctm %util
sdd 0.00 0.00 48110.00 0.00 192544.00 0.00 8.00 27.83 0.58 0.58 0.00 0.02 99.60
This is a buffered run
Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await r_await w_await svctm %util
sdd 0.00 0.00 4863.00 0.00 19780.00 0.00 8.13 0.89 0.18 0.18 0.00 0.18 85.60
So it looks like the key difference is the queue size (avgqu-sz), which is small when using buffered I/O. I find it weird given that nr_requests and queue_depth are all high:
root@XX:~# cat /sys/block/sdd/queue/nr_requests
128
root@XX:~# cat /sys/block/sda/device/queue_depth
256
Any advice here ?