Why is io scheduler not merging requests?

Question

After some time (or amount of data written) after server reboot the write speed drops to less than 1 MB/s. This is regardless of which filesystem (also raw partition) and regardless of HDD (HW RAID) or SSD (SW RAID with SSD connected to motherboard AHCI ports, not the raid board). We are testing with command dd if=/dev/zero of=tst bs=1M oflag=dsync (I have tried also 1k and also without dsync, but performance was not better).

The only weird thing I noticed was that the avgrq-sz is only 8 in the iostat output (on other tested servers it was more than 600) and that the req/s is about 100 (also on SSD). Running more dd in parallel gave each of them 1 MB/s and each of them about 100 req/s.

Sample iostat -xN 1 output:

Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
sdc               0.00     0.00    0.00  125.00     0.00   500.00     8.00     0.00    0.00    0.00    0.00   0.00   0.00
sdc               0.00     0.00    0.00  124.00     0.00   496.00     8.00     0.00    0.00    0.00    0.00   0.00   0.00
sdc               0.00     3.00    0.00  128.00     0.00   524.00     8.19     0.00    0.00    0.00    0.00   0.00   0.00
sdc               0.00     6.00    0.00  124.00     0.00   728.00    11.74     0.00    0.00    0.00    0.00   0.00   0.00
sdc               0.00     0.00    0.00  125.00     0.00   500.00     8.00     0.00    0.00    0.00    0.00   0.00   0.00
sdc               0.00     3.00    0.00  128.00     0.00   524.00     8.19     0.00    0.00    0.00    0.00   0.00   0.00

Iostat output with 8x dd running:

sdc               0.00    64.00    0.00  959.00     0.00  7560.00    15.77     0.00    0.00    0.00    0.00   0.00   0.00

lsblk -O output is consistent with other servers, which do not have this problem (like MIN-IO, RQ-SIZE, LOG-SEC). Current kernel is 4.9.16-gentoo, but the issue started with older kernel. Running dd with oflag=direct is fast.

EDIT: Based on shodanshok's answer I now see that the requests are indeed small, but the question is why io scheduler does not merge them into larger requests? I have tried both cfq and deadline schedulers. Is there anything I can check (or compare against other servers)?

Output when running with oflag=direct (speed is OK):

Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
sdc               0.00     0.00    2.00  649.00     8.00 141312.00   434.16     2.78    4.26    2.00    4.27   1.53  99.60
sdc               0.00     0.00    0.00  681.00     0.00 143088.00   420.23     2.71    3.99    0.00    3.99   1.46  99.20
sdc               0.00     0.00    2.00  693.00     8.00 146160.00   420.63     2.58    3.71    0.00    3.72   1.42  98.80
sdc               0.00    49.00    2.00  710.00     8.00 146928.00   412.74     2.68    3.76   22.00    3.71   1.39  99.20
sdc               0.00     0.00    1.00  642.00     4.00 136696.00   425.19     2.43    3.79   60.00    3.71   1.42  91.60

Server is Dell PowerEdge R330 with 32 GB RAM, LSI MegaRAID 3108 controller with HDDs, SSDs connected to onboard SATA, Intel E3-1270 CPU. Filesystem ext3, but the same happens with dd to raw partition.

Output of lsblk (sdc is HW RAID HDD, sda/sdb SW RAID SSDs):

NAME    MAJ:MIN RM   SIZE RO TYPE  MOUNTPOINT
sdb       8:16   0 223.6G  0 disk
|-sdb4    8:20   0     1K  0 part
|-sdb2    8:18   0   140G  0 part
| `-md1   9:1    0   140G  0 raid1 /
|-sdb5    8:21   0    25G  0 part
| `-md2   9:2    0    25G  0 raid1 /var/log
|-sdb3    8:19   0     1G  0 part
|-sdb1    8:17   0    70M  0 part
| `-md0   9:0    0    70M  0 raid1 /boot
`-sdb6    8:22   0  57.5G  0 part
  `-md3   9:3    0  57.5G  0 raid1 /tmp
sr0      11:0    1  1024M  0 rom
sdc       8:32   0   3.7T  0 disk
`-sdc1    8:33   0   3.7T  0 part  /home
sda       8:0    0 223.6G  0 disk
|-sda4    8:4    0     1K  0 part
|-sda2    8:2    0   140G  0 part
| `-md1   9:1    0   140G  0 raid1 /
|-sda5    8:5    0    25G  0 part
| `-md2   9:2    0    25G  0 raid1 /var/log
|-sda3    8:3    0     1G  0 part
|-sda1    8:1    0    70M  0 part
| `-md0   9:0    0    70M  0 raid1 /boot
`-sda6    8:6    0  57.5G  0 part
  `-md3   9:3    0  57.5G  0 raid1 /tmp

With oflag=direct the speed is OK, but the issue is that applications don't use direct io, so even plain cp is slow.

/sys/block/sdc/queue/hw_sector_size : 512
/sys/block/sdc/queue/max_segment_size : 65536
/sys/block/sdc/queue/physical_block_size : 512
/sys/block/sdc/queue/discard_max_bytes : 0
/sys/block/sdc/queue/rotational : 1
/sys/block/sdc/queue/iosched/fifo_batch : 16
/sys/block/sdc/queue/iosched/read_expire : 500
/sys/block/sdc/queue/iosched/writes_starved : 2
/sys/block/sdc/queue/iosched/write_expire : 5000
/sys/block/sdc/queue/iosched/front_merges : 1
/sys/block/sdc/queue/write_same_max_bytes : 0
/sys/block/sdc/queue/max_sectors_kb : 256
/sys/block/sdc/queue/discard_zeroes_data : 0
/sys/block/sdc/queue/read_ahead_kb : 128
/sys/block/sdc/queue/discard_max_hw_bytes : 0
/sys/block/sdc/queue/nomerges : 0
/sys/block/sdc/queue/max_segments : 64
/sys/block/sdc/queue/rq_affinity : 1
/sys/block/sdc/queue/iostats : 1
/sys/block/sdc/queue/dax : 0
/sys/block/sdc/queue/minimum_io_size : 512
/sys/block/sdc/queue/io_poll : 0
/sys/block/sdc/queue/max_hw_sectors_kb : 256
/sys/block/sdc/queue/add_random : 1
/sys/block/sdc/queue/optimal_io_size : 0
/sys/block/sdc/queue/nr_requests : 128
/sys/block/sdc/queue/scheduler : noop [deadline] cfq
/sys/block/sdc/queue/discard_granularity : 0
/sys/block/sdc/queue/logical_block_size : 512
/sys/block/sdc/queue/max_integrity_segments : 0
/sys/block/sdc/queue/write_cache : write through

score 4 · Answer 1 · answered May 13 '17 at 21:24

4

The request size (avgrq-sz) is small because you are issuing small write requests. Your dd command, albeit specifing a 1 MB block size, is hitting the pagecache and so each 1 MB request really is a collection of 256 * 4 KB requests. This is reflected by the avgrq-sz which, being expressed in 512-bytes units, perfectly align with the 4 KB-sized page entry. Moreover, even SSD can have bad performance with synchronized writes, as requested by oflag=dsync. That said, please note that the I/O scheduler should merge these small 4 KB-sized requests in bigger ones, but this is not happening.

Some things to check:

what do you see issuing cat /sys/block/sdc/queue/scheduler? If noop is the selected scheduler, try to select deadline
is your /sys/block/sdc/queue/max_sectors_kb at least 1024?
try executing dd if=/dev/zero of=tst bs=1M oflag=direct: I/O performance should be much higher, right?

answered May 13 '17 at 21:24

shodanshok

47,711
7
111
180

Scheduler was `cfq`, but I tried also `deadline`. `max_sectors_kb` is 256 and yes, using `direct` the speed is much better. But why it is fast with `dsync` after server reboot? Was the scheduler merging the requests then, but not anymore? Also I need to check with colleague which appl is really slow as maybe it is issuing similar requests as `dsync` is. – Marki555 May 13 '17 at 22:55
But also `dd` without `oflag` is now slow... so the issue really is that the io scheduler is not merging as it should? – Marki555 May 13 '17 at 22:57
There is also a `nomerges` flag file which should be default 0 (doc: https://www.kernel.org/doc/Documentation/block/queue-sysfs.txt) – eckes May 15 '17 at 17:36
Yes, `nomerges` is 0. However I now tried on a good server and `iostat` is not reporting any merges (`wrqm/s` is 0) and `avgrq-sz` is >500. So maybe the pagecache is not dividing the requests into 4kB chunks there? – Marki555 May 15 '17 at 18:23
Can you post `iostat -xN 1` output when running `dd` with `bs=1M oflag=direct`? Can you add some hardware details? What filesystem are you using? – shodanshok May 15 '17 at 19:40
@shodanshok added info to question. – Marki555 May 15 '17 at 20:01
The added `iostat` data collected with `oflag=direct` show a rather large (~ 200KB) request size, and a reasonable throughput (~140 MB/s). So, with `oflag=direct`, do the slowness disappear? Why on `iostat` we see `sdc` repeated so much times? Can you show the output of `lsblk`? – shodanshok May 16 '17 at 08:36
Each line in `iostat -xN 1` output is from another second, I just omitted other devices. I plan to try `blktrace`, but don't know if anything interesting will show there. `lsblk` output added. – Marki555 May 16 '17 at 13:22
For further debug, please post the output of the following command: `find /sys/block/sdc/queue/ -type f -printf "%h/%f : " -exec cat '{}' ';'` – shodanshok May 18 '17 at 13:04
Added. It seems normal to me (compred to other good servers I have). – Marki555 May 18 '17 at 14:53

Why is io scheduler not merging requests?

1 Answers1