3

To pre-warm an ext4 EBS volume I'm using fio as follows:

fio --name <filename> --filename <filename> --rw read --direct 1 --ioengine libaio --bs <X>k

and I'm trying to understand what the optimal block size should be. I know that I can 'stat' and get the block size of a file but when using that in fio, it will throw an error if the file size is less than its block size.

An option could be to use the block size given by stat by default and if the file size is less than that, get the closest 'standard' number: i.e. if size is less than 4k set block size to 1024.

What's the best way of setting the right block size?

EDIT: I'm restoring a 10TB gp2 volume from a snapshot. There are a few million files - most of them are small files, but another good part is made of 50MB-30GB files and all these files need to be "ready" to be read as fast as possible. I've got a script that runs fio against each file and I'm trying to understand how I can best dynamically adjust the block size for each.

XCore
  • 177
  • 2
  • 7
  • Why aren't you using the block size they gave in the linked document? Though, I don't really expect it to matter all that much. – Michael Hampton Sep 24 '19 at 10:42
  • The document suggests to prewarm the whole volume, but that's unnecessary as you only want to prewarm the blocks that are actually in use (so files vs block devices). Using a block size bigger than the file will make fio throw an error. – XCore Sep 24 '19 at 10:48
  • It also says you only need to prewarm when you have restored an EBS volume from a snapshot. That said, if you only want to prewarm a single file, then you'd probably have to use the minimum block size as you've already proposed. – Michael Hampton Sep 24 '19 at 11:22

1 Answers1

1

In order to mask network latency, you want to use a reasonably large block size. The Amazon-suggested 1 MB block size seems good to me.

I suspect dd would be as fast, or faster, than fio for this particular workload. However, you simply had to experiment and use whatever method is faster for reading (and re-hydrating) the volume.

Finally, consider that stat returns two I/O size values:

  • minimum, which is the minimum IO size the device will read/write;
  • optimal, which is the minimum IO size to get good performance by avoiding r/m/w behavior

This does not means that IO bigger than optimal size will be slower; rather, bigger size can actually slightly increase IO performance.

shodanshok
  • 47,711
  • 7
  • 111
  • 180
  • 1
    The documentation linked to says fio is faster as it's multi-threaded. It also suggests 128k as the block size, though it doesn't say why. I agree moving to 1MB is probably a good plan. I'd probably use fio and 1MB. – Tim Sep 24 '19 at 21:15
  • 1
    @Tim The documentation uses `fio` with 128K blocksize *and* iodepth at 32, which means a ~4MB data chunk is in flight in each moment. Anyway, due to its sheer simplicity, I sometime see `dd` to be slightly faster than `fio` for raw data read/write. As always - your mileage may vary; the OP should try by himself. – shodanshok Sep 25 '19 at 06:00
  • 1
    If the block size is 1MByte or larger there's a [chance that the kernel will split it](https://unix.stackexchange.com/a/533845/109111) creating a bit more device parallelism at the cost of more kernel overhead (you would need to check the size of I/Os actually being submitted down to disk to know for sure). A tiny block size on fio won't help your throughput on big files (as you just force more work to be done per I/O) but if the file is tiny you'll have no choice but as @shodanshok stated its key to set the iodepth too (fio is multithreaded but in this case its just being asynchronous ;-). – Anon Sep 28 '19 at 07:47
  • @shodanshok does stat really report the sizes size that avoid all rmw? I thought that the values it gave back were only in reference to the block level but not necessarily at the meta level (think software RAID stripe size) but perhaps that's exactly what you meant ;-) – Anon Sep 28 '19 at 07:54
  • @Anon `iostat` optimal block size is the IO size at which the storage layer is expected to work optimally - mainly avoiding r/m/w. Anyway, for the optimal block size to report realistic values, *any* storage layer had to pass-up the underlying information about chunk/stripe size (and Linux software RAID does exactly that, by the way). – shodanshok Sep 28 '19 at 20:30
  • @shodanshok Aren't you referring to [`stat(1)`](https://linux.die.net/man/1/stat) in your main answer? Now I think about it, `stat(1)`'s life is more complicated because it can end up reflecting what the filesystem says it needs which need not be what the device has said. e.g. if I do `modprobe scsi_debug opt_blks=256 physblk_exp=4 && mke2fs /dev/$scsidebugdev && mkdir /tmp/scsi_debug; mount /dev/$scsidebugdev /tmp/scsi_debug && grep . /sys/block//queue/{logical_block_size,physical_block_size,optimal_io_size}; stat -c "%S %s %o" /dev/$scsidebugdev` the numbers don't match. – Anon Sep 29 '19 at 06:59
  • ...and then doing `stat -c "%S %s %o" /tmp/scsi_debug/` will give us different results (`? 1024 1024`) again! While the filesystem part is a massive aside I don't believe `stat` is the right command when you trying to find block size/transfer information from a block device (as opposed to when you're atop a filesystem). – Anon Sep 29 '19 at 07:09
  • @Anon for a block devices you should use `lsblk -t`. Regarding `stat`, the value returned *should* be the correct one, but no strong guarantee exists. – shodanshok Sep 29 '19 at 15:03