We are seeing instances of a VM hanging with the following symptoms:
- load average 800, processes stuck, CPU 100% in iowait
- reading files work, writing files hang the system
- RAM utilization is high, but that is expected when the system is working properly
- /var/log/messages does not show anything suspicious: no kernel crash, no OOM kill, but we have some kernel stacktrace like Tasks blocked for more than 120s with a stacktrace related to the storage.
- hypervisor shows VM almost idle in terms of CPU utilization. Rebooting the system is the only way to make it work again.
- stack trace on dmesg mentioning kernel tasks hung for more than 120 seconds in io_write / sync syscalls
The hypervisor is Oracle Enterprise Linux 7.2, the VM is CentOS 6.6. It is running a jboss appliance. The block device is of type virtio. The qcow drive is hosted locally on the hypervisor, in a SSD. We suspect something is wrong in the filesystem -> block device -> virtio layer.