5

I am reading a large file sequentially from the disk and trying to understand the iostat output while the reading is taking place.

  • Size of the file : 10 GB
  • Read Buffer : 4 KB
  • Read ahead (/sys/block/sda/queue/read_ahead_kb) : 128 KB

The iostat output is as follows

Device:         rrqm/s   wrqm/s     r/s     w/s    rMB/s    wMB/s avgrq-sz avgqu-sz await r_await w_await  svctm  %util
sda               0.00     0.00  833.00   14.00   103.88     0.05   251.30     6.07    5.69    2.33 205.71  1.18 100.00

Computing the average size of an I/O request = (rMB/s divided by r/s) gives ~ 128 KB which is the read ahead value. This seems to indicate that while the read system call has specified a 4KB buffer, the actual disk I/O is happening according to the read ahead value.

When I increased the read ahead value to 256KB, the iostat output was as follows

Device:         rrqm/s   wrqm/s     r/s     w/s    rMB/s    wMB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
sda               0.00    28.00  412.00   12.00   102.50     0.05   495.32    10.78   12.15    4.76  265.83   2.36 100.00

Again the average I/O request size was 256 KB matching the read ahead.

This kept up until I set 512 KB as the read ahead value and did not hold up when I moved up to a read ahead value of 1024 KB - the average size of the I/O request was still 512 KB. Increasing max_sectors_kb (maximum amount of data per I/O request) from the default of 512 KB to 1024 KB also did not help here.

Why is this happening - ideally I would like to minimize my read IOPS as much as possible and read larger amount of data per I/O request (larger than 512 KB per request). Additionally, I am hitting 100% disk utilization in all cases - I would want to throttle myself to read at 50-60% disk utilization with good sequential throughput. In short, what are the optimized application/kernel settings for sequential read I/O.

Stormshadow
  • 6,769
  • 9
  • 33
  • 34
  • 1
    It's possible that I/O scheduler algorithm has a say in this. According to [RHEL performance tuning guide (see 5.3.6)](https://access.redhat.com/documentation/en-US/Red_Hat_Enterprise_Linux/7/pdf/Performance_Tuning_Guide/Red_Hat_Enterprise_Linux-7-Performance_Tuning_Guide-en-US.pdf), `read_ahead_kb` works best with noop scheduler. Noop doesn't do much in terms of scheduling and hence reacts better to block layer parameters. To see I/O scheduler currently being used by the kernel `cat /sys/block//queue/scheduler`. – bytefire Oct 19 '16 at 08:48
  • 1
    Adaptive readahead will also interfere with how much data is read. As per [this](https://www.quora.com/What-heuristics-does-the-adaptive-readahead-implementation-in-the-Linux-kernel-use), adaptive readahead can be turned off by passing in `POSIX_FADV_RANDOM` flag to `posix_fadvise()` system call. – bytefire Oct 19 '16 at 09:31
  • @bytefire - you are spot on - using the noop scheduler and increasing the read_head_kb and max_sectors_kb, I was able to increase the average size of a read request to upto 3.2 MB/s (80 MB/s in 25 reads). – Stormshadow Oct 20 '16 at 05:23

1 Answers1

4

The reason why read ahead didn't work in the 1024kb case should be that the sector size of your harddisk is 512kb. Please check your hard-disk sector size with command "fdisk -l". Even if you changed the read ahead size and max sector size parameter, the actual size of an IO still is no more than the hardware IO size(sector size).

Jun Ge
  • 408
  • 4
  • 13
  • That is interesting - is there any documentation regarding this? Also, would it not be beneficial for the disk to accept larger reads in order to prevent the scope of interrupting sequential reads with other random reads? – Stormshadow Oct 20 '16 at 04:30
  • _Conventionally_, Linux kernel treats sector size as 512 bytes and if the disk has different sector size, the low-level block device driver does necessary translation. – bytefire Oct 20 '16 at 10:56
  • A related question: What are the limits for `queue/read_ahead_kb`? During automatic testing I found that `0` is acceptable, `4` also is, but `1` is not. I read that the value has to be the device's block size at least, so for 512 bytes blocks 1k should be OK, right? – U. Windl Nov 27 '17 at 08:39