Drawbacks of setting a high I/O timeout?

Question

I'm working on several Linux VMs whose partitions are mounted on a NetApp NAS. This NAS periodically experiences a very high iowait which causes the VM disks to switch to read-only mode, crash, or be corrupted.

On the VMware KB it is suggested to increase the timeout value as a palliative cure:

echo 180 > /sys/block/sda/device/timeout

What could be the negative effects of setting a very high timeout (1800 or more)? The way I see it, the risk is that the delayed writes accumulate and fill up the I/O write buffer, crashing the system. Therefore this solution might be worse than the issue.

I'm dealing with the same thing right now. My approach would be to prevent the filesystems from going read-only (on error) in the first place. I think that's a mount option, and it probably does not belong in a virtualized environment. I mean, under what circumstances would you *want* your filesystem to be read-only? — ewwhite, Nov 16 '15 at 16:07
"*The way I see it, the risk is that the delayed I/O accumulate and fill up the I/O queue, crashing the system.*" That makes no sense. What I/O queue are you talking about? How could reads accumulate? And if this was a problem with writes at all, what would stop writes from accumulating regardless of the timeout? — David Schwartz, Nov 16 '15 at 22:37
I was talking about writes, that either need to be stored somewhere or are lost. Reads aren't a problem as the application would just wait. I have clarified the question. — dr_, Nov 17 '15 at 08:34

score 2 · Accepted Answer · answered Nov 17 '15 at 09:46

Most writes, being cached in the OS dirty pagecache, are already completed asynchronously. In other word, they often have nothing to do with device timeout.

However, reads and synchronized writes requires immediate attention from the underlying block device, and this is the very reason your filesystem switches to read-only mode (it can not write its journal to disk).

Increasing I/O wait time should have no bad impact, but it is not a silver bullet. For example, a database can go in read-only mode even it the underlying filesystem remain in read-write mode.

sourcejedi · Answer 2 · 2019-05-18T08:56:34.697

Note the default SCSI timeout is already 30 seconds. That is already a fairly long time in computer terms :-P.

IO requests (e.g. async writes) are bounded by /sys/class/block/$DEV/nr_requests, and /sys/class/block/$DEV/max_sectors_kb. In the old single-queue block layer, total memory usage is said to be 2*nr_requests*max_sectors_kb. The factor of 2 is because reads and writes are counted separately. Though you also need to account for requests in the hardware queue, see e.g. cat /sys/class/block/sda/device/queue_depth. You are generally expected to make sure the maximum hardware queue depth is no larger than half of nr_requests.

1) It is written that if your IO requests need too much space, you will get out of memory errors. So you could have a look at the above values on your specific system. Usually they are not a problem. nr_requests defaults to 128. The default value of max_sectors_kb depends on your kernel version.

If you use the new multi-queue block layer (blk-mq), reads and writes are not counted separately. So the "multiply by two" part of the equation goes away, and nr-requests defaults to 256 instead. I am not certain how the hardware queue (or queues) is treated in blk-mq.

When the request queue is full, async writes can build up in the page cache until they hit the "dirty limit". Historically the default dirty limit is described as 20% of RAM, although the exact determination is slightly more complex nowadays.

When you hit the dirty limit, you just have to wait. The kernel does not have another hard timeout beyond the SCSI timeout. In that sense, the common documents on this topic, including the VMware KB, are quite sufficient. Although you should search for the specific documentation that applies to your NAS :-P. Different vintages of NAS have been engineered to provide different worst-case timings.

2) That said, if a process has been waiting for disk IO for more than 120 seconds, the kernel will print a "hung task" warning. (Probably. That's the usual default. Except on my version of Fedora Linux, where the kernel seems to have been built without CONFIG_DETECT_HUNG_TEST. Fedora appears to be a weird outlier here).

The hung task message is not a crash, and it does not set the kernel "tainted" flag.

After 10 hung task warnings (or whatever you set sys.kernel.hung_task_warnings to), the kernel stops printing them. Thinking about this, in my opinion you should also increase the sysctl sys.kernel.hung_task_timeout_secs so that it is above your SCSI timeout, e.g. 480 seconds.

3) Individual applications may have their own timeouts. You probably prefer to see an application timeout, rather than have the kernel return an IO error! Filesystem IO errors are commonly considered fatal. The filesystem itself may remount read-only after an IO error, depending on configuration. IO errors in swap devices or memory-mapped files will send the SIGBUS signal to the affected process, which will usually terminate the process.

4) If using systemd, services which have a watchdog timer configured could be forcibly restarted. In current versions of systemd, you can see e.g. a timeout of 3 minutes if you run systemctl show -p WatchdogUSec systemd-udevd. This was increased four years ago for a different reason; it appears to be just a co-incidence that this matches VMware's suggested SCSI timeout :-). These restarts could generate alarming log noise. systemd kills the process with SIGABRT, with the idea of getting a core dump to show where the process got stuck. However stuff like udev and even journald is supposed to be quite happy to be restarted nowadays.

The main concern would be to make sure that you have not configured a too-short userspace reboot watchdog, e.g. RuntimeWatchdogSec= in /etc/systemd-system.conf. Even if you do not use swap, it would be possible for systemd to become blocked by disk IO, by a memory allocation that enters kernel "direct reclaim".

Drawbacks of setting a high I/O timeout?

2 Answers2