Note the default SCSI timeout is already 30 seconds. That is already a fairly long time in computer terms :-P.
IO requests (e.g. async writes) are bounded by /sys/class/block/$DEV/nr_requests
, and /sys/class/block/$DEV/max_sectors_kb
. In the old single-queue block layer, total memory usage is said to be 2*nr_requests*max_sectors_kb
. The factor of 2 is because reads and writes are counted separately. Though you also need to account for requests in the hardware queue, see e.g. cat /sys/class/block/sda/device/queue_depth
. You are generally expected to make sure the maximum hardware queue depth is no larger than half of nr_requests
.
1) It is written that if your IO requests need too much space, you will get out of memory errors. So you could have a look at the above values on your specific system. Usually they are not a problem. nr_requests
defaults to 128. The default value of max_sectors_kb
depends on your kernel version.
If you use the new multi-queue block layer (blk-mq), reads and writes are not counted separately. So the "multiply by two" part of the equation goes away, and nr-requests
defaults to 256 instead. I am not certain how the hardware queue (or queues) is treated in blk-mq
.
When the request queue is full, async writes can build up in the page cache until they hit the "dirty limit". Historically the default dirty limit is described as 20% of RAM, although the exact determination is slightly more complex nowadays.
When you hit the dirty limit, you just have to wait. The kernel does not have another hard timeout beyond the SCSI timeout. In that sense, the common documents on this topic, including the VMware KB, are quite sufficient. Although you should search for the specific documentation that applies to your NAS :-P. Different vintages of NAS have been engineered to provide different worst-case timings.
2) That said, if a process has been waiting for disk IO for more than 120 seconds, the kernel will print a "hung task" warning. (Probably. That's the usual default. Except on my version of Fedora Linux, where the kernel seems to have been built without CONFIG_DETECT_HUNG_TEST. Fedora appears to be a weird outlier here).
The hung task message is not a crash, and it does not set the kernel "tainted" flag.
After 10 hung task warnings (or whatever you set sys.kernel.hung_task_warnings
to), the kernel stops printing them. Thinking about this, in my opinion you should also increase the sysctl
sys.kernel.hung_task_timeout_secs
so that it is above your SCSI timeout, e.g. 480 seconds.
3) Individual applications may have their own timeouts. You probably prefer to see an application timeout, rather than have the kernel return an IO error! Filesystem IO errors are commonly considered fatal. The filesystem itself may remount read-only after an IO error, depending on configuration. IO errors in swap devices or memory-mapped files will send the SIGBUS signal to the affected process, which will usually terminate the process.
4) If using systemd
, services which have a watchdog timer configured could be forcibly restarted. In current versions of systemd
, you can see e.g. a timeout of 3 minutes if you run systemctl show -p WatchdogUSec systemd-udevd
. This was increased four years ago for a different reason; it appears to be just a co-incidence that this matches VMware's suggested SCSI timeout :-). These restarts could generate alarming log noise. systemd
kills the process with SIGABRT, with the idea of getting a core dump to show where the process got stuck. However stuff like udev and even journald is supposed to be quite happy to be restarted nowadays.
The main concern would be to make sure that you have not configured a too-short userspace reboot watchdog, e.g. RuntimeWatchdogSec=
in /etc/systemd-system.conf
. Even if you do not use swap, it would be possible for systemd
to become blocked by disk IO, by a memory allocation that enters kernel "direct reclaim".