0

We're observing poor file read IO results that we'd like to better understand. We can use fio to write 100 files with a sustained aggregate throughput of ~700MB/s. When we switch the test to read instead of write, the aggregate throughput is only ~55MB/s. The drop seems related to the number of files since the throughput for read and write are comparable for a single file then diverge proportionally as we increase the number of files.

The test server has 24 CPU cores, 48GB of memory, and is running CentOS 6.0. The disk hardware is a RAID 6 array with 12 disks and a Dell H800 controller. This device is partitioned with ext4 using the default settings.

Increasing the readahead (using blockdev) improves the read throughput significantly but it still doesn't match write speed. For instance, increasing the readahead from 128KB to 1M improved the read throughput to ~145MB/s.

Below are iostat results for the read case:

$ iostat -mx 2

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           0.06      0.00       0.15       4.06      0.00     95.73

Device:         rrqm/s   wrqm/s     r/s     w/s    rMB/s    wMB/s avgrq-sz avgqu-sz   await  svctm  %util
sda               0.00     0.00  524.00    0.00    73.12     0.00   285.77    27.07   51.70   1.90  99.70

and write case:

$ iostat -mx 2

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           0.73      0.00    4.98         2.92      0.00      91.37

Device:         rrqm/s   wrqm/s     r/s     w/s    rMB/s    wMB/s avgrq-sz avgqu-sz   await  svctm  %util

sda               0.00 195040.50    0.00 3613.00     0.00   776.79   440.32   137.23   37.88   0.28 100.00

One oddity is that rrqm/s is at 0.0 for the read case.

Is this a known performance issue in our OS/disk/filesystem configuration? If so, how can we tell? If not, what tools or tests can we use to further isolate the issue?

Thanks.

bfallik-bamboom
  • 227
  • 4
  • 7
  • 1
    Just a guess, but could it be that the read throughput is more heavily influenced by the drive's seek time while write throughput isn't since it since it can just fragment the files? – Chris Nava Aug 29 '12 at 20:48
  • @ChrisNava - both reads and writes are sequential in the file. Is there a way to watch the seek activity to determine if your theory is correct? – bfallik-bamboom Aug 29 '12 at 21:03
  • run "iostat -mx 2" while running your test, and see how many IOs there are, and the usage percentage of the device while reading, and please post an example of output. – wazoox Aug 29 '12 at 21:25
  • too little memory available for caching and I/O buffers? – mdpc Aug 29 '12 at 21:49

1 Answers1

0

This is definitely bound on head seeks, even if each file is read and written sequentially, working simultaneously means the drive head has to jump between each file all the time.

The iostat output clearly show this picture:

Most drives have average seek times between 8 and 11ms, spread on a 12-drive array would get you at best around 1-2msec, which agrees with the 1.90 svctm figure.

Thus, an ~2msec read gives you the ~500 reads/sec. If each read is 128KB, you get ~64MB/sec. Bigger reads could get you far higher, but in your iostat it shows an avgrq-sz of just 285KB/read. Evidently, the IO scheduler has to reduce the request size so other reads don't wait too long. I guess you're using the deadline scheduler, since it has precisely that priority: not to make any process wait too long.

The write performance stays high because with enough RAM, the IO scheduler can aggregate enough data for each stream, making it closer to sequential access. The avgrq-sz is only about twice as big, but the avgqu-sz means it has five times as much operations queued, accounting for the ten times better throughput.

Now, how to make better (more sequential-like) reads? The obvious way (and the only one guaranteed, IMHO) is to reduce the number of simultaneous files. You can also try other schedulers; I don't know if the cfq would favor bandwidth over latency, maybe the noop one would perform better, but it might make the rest of the system very unresponsive. Finally, there are several parameters to tuneup either scheduler, you might play around with these until you find your own ideal setting.

Javier
  • 9,268
  • 2
  • 24
  • 24