I have a setup with to nodes with drbd to sync kvm VPSes for fallback. So the vps is only active on one node. The active node has 4 kvm vpses on them.
The two nodes have a dedicated 10G interface for drbd sync. So that should not give an io problem.
Sysbench gives a disk io performance of about 400Mb/s.
The problem is that on random intervals, one of the VPSes starts to peak in io at a rate of about 400MB/s (same disk io limit) and becomes unresponsive. The other vpses are still responsive at that time. I'm unable to find what is causing the high I/O at that moment. The server is not responsive so I can't login with ssh at that moment. I do use telegraf->influxdb to monitor the vps. There I can see that the I/O is going high, but I'm not sure how I can use it to find which application/user is causing the high load and/or why only this vps is affected but not the other vpses while they use the same underlaying drbd/disks.
Any suggestion on how to debug this?