Cluster Hyper-V Environment Failure

Question

We have a two host fiber (NIC Team) and copper (Nic Team2) environment. The hosts are clustered and use 2012-R2 Standard (updated) with Hyper-V Clustering and storage pools. The VMs are around 50 Debian machines equally distributed. The networks are three subnets, Cluster, Switch 0, Switch 1. Two are cluster and client, one is cluster only.

Every once in awhile, the entire environment crashes. The most noticeable signs are CPU on VMs jump to 100% and both the network access to machines physical and virtual is unusable. The only way to combat this is a hard shutdown of both hosts which when done, return to normal.

Now here is what I think I know from crawling through logs and viewing our aggregate logging and performance data (note: not every message applies to each incident, this is an aggregate):

Windows:

-TCP ports run out / TCP local endpoint same as remote, reusing local ports - Event ID 4227

-I/O access redirected over network - EventCode=5121

-Cluster Shared Volume is paused - EventCode=5121

-TCP local endpoint same as remote, reusing local ports - Event ID 4227

-Ephemeral port exhaustion - Event ID 4231

Linux:

-High CPU in TOP - ksoftirq

My interpretation: There is a leak either on the host or vm side that eats all the TCP ports and causes a backup of VMQs. This creates a backlog on the environment eventually causing a crash.

My issue: How do I determine what causes the issue exactly? Are there ways to mitigate the issue without knowing the specifics?

This may be related to the iSCSI being broken in updates. Will know more later. — user206106, Jun 26 '17 at 13:58

score 1 · Answer 1 · answered Jun 29 '17 at 09:52

Since the Teaming functionality does not have any built i load balancing functionality that would equally balance the load between the teamed NICs the issue may be based on the NIC teaming aspect of the configuration, have you tried removing the teams for testing purposes?

score 0 · Answer 2 · answered Jun 22 '17 at 22:10

Not a direct answer but some general advice

Most of the problems we faced were resolved by installing the hotfixes published by MS. So many were there that they dedicated pages to list them, and I don't think they bothered to roll them all in to updates:

Hyper-V 2012 R2 and related hotfixes (also links to other relevant lists, e.g HNV, clusters)

There is a script someone published that will install most of them. I think it's this one.

Further to this. If you suspect it's VMQ related, have you tried adjusting the configuration or switching them off at the VM level?

Guidance for correct configuration of VMQ

The pause states we saw were also caused by two things. Slow storage performance, and oversized LUNs. The latter was only a problem when we had too many active VSS snapshots during the backup window - probably not relevant in this case. Check the cluster diagnostic log for more info on the auto pause event or look up the (e.g) c000026e status/reason code on the web.

CSV troubleshooting

Other than that... Driver and firmware updates on the NIC and storage devices.

Cluster Hyper-V Environment Failure

2 Answers2