5

My research suggests that both the standard and the maximum (kernel limitations) block size for modern file systems (ext4, xfs) is 4KB. However, AWS allows IO operations as large as 256KB and says

For 32 KB or smaller I/O operations, you should see the amount of IOPS that you have provisioned, provided that you are driving enough I/O to keep the drives busy. For smaller I/O operations, you may even see an IOPS value that is higher than what you have provisioned (when measured on the client side), and this is because the client may be coalescing multiple smaller I/O operations into a smaller number of large chunks.

Where does Linux expose and/or allow configuration of "device block size?" When doing say, a full table scan in postgres (8KB blocksize), where can you see and/or configure the size of "IO Operations" the OS issues?

Craig Ringer
  • 11,083
  • 9
  • 40
  • 61
user4258
  • 151
  • 1
  • 2
  • 3
    It depends which block size you are speaking about : filesystem block size ? Disk block size ? Page size ? It's a bit confused in your question. – Xavier Lucas Nov 16 '14 at 23:13
  • You also have to factor in `fsync`s for writes in PostgreSQL; these also count as I/O operations and may limit write coalescing on the client side too. They're essential for crash-safety though, so there's a reason PostgreSQL does them – Craig Ringer Nov 17 '14 at 04:20

1 Answers1

4

The size of one I/O operation is dependent on a lot of things. Calculating the average for your application is not necessarily a bad idea.

Amazon's definition of them implies that their hardware supports 256KB blocks. A single I/O operation is the read or write of one block. An "unaligned" access to the hardware where an operation spans two blocks will result in two I/O operations, even if the software and hardware block sizes match. This is why it is so useful for I/O performance to use a filesystem blocksize the same size as the hardware blocksize, though this can reduce storage efficiency since a filesystem block is the allocation quantum.

Block sizes in filesystems are largely dictated by memory page size, since reads go into memory pages. Normally, memory pages are 4kB in x86 Linux; though the kernel can map larger pages, there isn't an intermediary size between a normal 4kB page and a huge 4MB page. So, you can't really tune this on modern systems and hardware.

However, the filesystem can try to make all reads and writes across several blocks sequential, by preventing fragmentation. EXT4 does this by allocating files sparsely on the disk instead of allocating the next free block when a block is requested; other filesystems have strategies for this as well. The kernel (filesystem and disk drivers) can aggregate a single read operation spanning multiple blocks into a single read operation when those blocks are physically consecutive and do not cross a physical block boundary.

The disk driver discovers the disk blocksize automatically; you cannot adjust it. You can read it with blockdev --getpbsz /dev/xvda or what have you.

Falcon Momot
  • 25,244
  • 15
  • 63
  • 92
  • According to [this Q&A](http://unix.stackexchange.com/questions/145241/how-to-set-block-size-using-blockdev-command) the blocksize could be adjusted. – 030 Nov 16 '14 at 23:38
  • Check out the manpage http://man7.org/linux/man-pages/man8/blockdev.8.html. Also, the physical blocksize is what matters here. – Falcon Momot Nov 16 '14 at 23:42
  • Ok. Agreed. `blockdev --getpbsz /dev/sda1` returns `512`. Could you explain this? `blockdev --report` shows the output of all drives. The BSZ of three drives is `4096` while the two others are `512`. The expectation was `4096 B`, i.e. 4kB. – 030 Nov 16 '14 at 23:54
  • According to this: https://wiki.archlinux.org/index.php/Advanced_Format "The logical sector size is the sector size used for data transfer", which in your case directly affects your IOP count. Tests I ran on my physical disk with 4k physical / 512 byte logical confirmed the statement. There is no way to alter this parameter as it is advertised by the disk as-is. – Matthew Ife Nov 16 '14 at 23:55
  • 1
    @utrecht It can't be adjusted, the block size in the case of an hard drive means sector size and that notion is decided by the manufacturer. However, you can change the block size for [special files in block mode created on your machine](http://linux.die.net/man/1/mknod) but that's all. Old HDDs use 512 bytes sector size while new ones use 4096 bytes sector size. – Xavier Lucas Nov 17 '14 at 00:00
  • 1
    I suspect that the Xen paravirt driver could be taught to coalesce sequential 4k reads/writes into bigger operations - at the expense of read/write latency. It'd also make sense to always readahead if you can do so "for free" - if the client asks for 4k, request 32k. At worst you just throw it away, at best you saved 8 reads. This would probably require tuning on the Xen *host* (dom0) though, not the guest (domU), and you don't control the host. – Craig Ringer Nov 17 '14 at 04:25