We have a two host fiber (NIC Team) and copper (Nic Team2) environment. The hosts are clustered and use 2012-R2 Standard (updated) with Hyper-V Clustering and storage pools. The VMs are around 50 Debian machines equally distributed. The networks are three subnets, Cluster, Switch 0, Switch 1. Two are cluster and client, one is cluster only.
Every once in awhile, the entire environment crashes. The most noticeable signs are CPU on VMs jump to 100% and both the network access to machines physical and virtual is unusable. The only way to combat this is a hard shutdown of both hosts which when done, return to normal.
Now here is what I think I know from crawling through logs and viewing our aggregate logging and performance data (note: not every message applies to each incident, this is an aggregate):
Windows:
-TCP ports run out / TCP local endpoint same as remote, reusing local ports - Event ID 4227
-I/O access redirected over network - EventCode=5121
-Cluster Shared Volume is paused - EventCode=5121
-TCP local endpoint same as remote, reusing local ports - Event ID 4227
-Ephemeral port exhaustion - Event ID 4231
Linux:
-High CPU in TOP - ksoftirq
My interpretation: There is a leak either on the host or vm side that eats all the TCP ports and causes a backup of VMQs. This creates a backlog on the environment eventually causing a crash.
My issue: How do I determine what causes the issue exactly? Are there ways to mitigate the issue without knowing the specifics?