First, try looking at your sar logs for resource usage around the time this error occurred:
CPU: sar -u
Memory: sar -r
- Check the column
%memused
, but more importantly check %commit
.
Load: sar -q
- Check for a load number above the number of CPUs you have (
cat /proc/cpuinfo | grep proc
).
Secondly, and most importantly, this error occurred because there is a time limit of 120 seconds to flush outstanding data to the disk. Linux, by default, uses up to 40% of available memory for file system caching. The outstanding data will be all data past this 40% mark. Once it moves past the 40% mark, the cache will switch from writing asynchronously (non-blocking background operation, letting the process continue) to synchronously (blocks and makes the process wait till the I/O is committed to the disk). If the IO subsystem cannot keep up and fails to flush the data within 120 seconds, this error will occur.
One popular solution is to force the system to flush sooner.
You can add the following to /etc/sysctl.conf
:
vm.dirty_ratio=10
(absolute max amount (10=10% in this case) of system memory that can be filled with dirty pages before flushing to disk)
vm.dirty_background_ratio=5
(percentage of system memory that can be filled with dirty pages before flushing)
I hope this helps you out!