0

We've been running into an issue recently with one of our server stacks. Our two 2008 R2 servers are running in a cluster set up to live migrate VMs between eachother in case there is ever a detected fault.

The servers are the exact same hardware-wise; they were ordered specifically for this purpose. Live migration had been working fine up until a couple months ago when we noticed that VIR001 could not migrate to VIR002. I've looked into this issue and I know that generally it is caused by improperly-named resources, but that doesn't seem to be the case here.

VIR002 will live migrate any of its hosted VMs over to VIR001. VIR001 will not LM any VMs over to VIR002. Not sure where to start with this, I've noticed a couple Time-Server errors on VIR001, but if the issue was due to a sync problem, wouldn't both servers experience the same issue?

Right now, looking for ideas on what to check. Thanks,

(Update: I've ran the Failover Cluster Validation tool and it found no issues. I could not run the Disk validation as our cluster is still online with the cluster. Both servers in question are also set as possible owners for cluster resources)

Insomnia
  • 507
  • 4
  • 10
  • Have you checked the basics for the VM, like the Possible Owners setting? – joeqwerty Jan 25 '13 at 21:50
  • Do you mean the Preferred owners setting? Preferred owners doesn't seem to make a difference. I have some VMs set up with no owners, both in a list, and individual owners. – Insomnia Jan 25 '13 at 22:40
  • Also, have you run the cluster validation wizard since this problem started occurring? – joeqwerty Jan 25 '13 at 22:40
  • No, not the Preferred Owners. The Possible Owners under the Advanced Policies tab of the resource properties (under the Services and applications node). If VIR002 isn't selected as a Possible Owner then those resources (the virtual machines) will never fail over to VIR002.. – joeqwerty Jan 25 '13 at 22:43
  • Also, going through the Cluster Manager, I noticed the errors I've been getting again. EventID 1127 - Microsoft-Windows-Failover-Clustering. Cluster interfact Local Area Connection 2 failed, etc. These failing NICs were grouped together in weird ways, the manager seemed to randomly decide which NICs should talk to eachother. I had disabled the Cluster Networks that weren't required or correct. The networks that should be used for migration are still enabled and listed without error. Could be part of the issue? – Insomnia Jan 25 '13 at 22:45
  • The Preferred Owners designates which hosts are the Preferred Owners of the clustered resource but a clustered resource may still fail over to a non-Preferred Owner if no Preferred Owner is available. A Possible Owner is a host which is allowed be a host for the clustered resource. If a host is not listed as a Possible Owner for the resource then that resource will never be allowed to fail over to that host. – joeqwerty Jan 25 '13 at 22:46
  • Gotcha. Just to confirm, my cluster HVCluster1 has both servers selected as possible owners. Same with the Cluster Disk 1. – Insomnia Jan 25 '13 at 23:35
  • I'll run the Cluster validation tool again. From memory, I believe it failed due to the incorrectly matched Clustering Networks. It was one of the reasons why I disabled some of them. – Insomnia Jan 25 '13 at 23:38
  • Cluster Validation tool just passed, couple warnings... Warnings are: Network and Disks. Disk didn't run due to being online. Hmm, just noticed that the "Microsoft Failover Cluster Virtual Adapter" on both servers are APIPA. Same subnet though. Other warnings complain that I disabled some adapters, or have multiple NICs on the same subnet. – Insomnia Jan 25 '13 at 23:54

1 Answers1

0

Well, finally found the issue:

I noticed that some of the created cluster networks were not legitimate (ie, they only contained one NIC, or were teamed with a NIC on a different subnet). I had disabled these. I was told by my colleagues that binding on the physical servers could make a difference. I changed these. I verified the cluster, made sure all nodes had both servers listed as possible owners, and to top it off, I had found the "Network for Live Migration" tab under properties for the Virtual Machine Resource.

I had ordered the cluster networks in "Network for Live Migration" in such a way that the Live Migration cluster network was first, followed by all active networks, with the disabled networks at the bottom. No love. Today after changing the binding and seeing no change, I decided to disable the all cluster networks in the Live Migration tab beyond three internal networks (LM, host, Cluster Domain). Now it's working.

Not sure what caused this to begin with. We haven't made any physical changes to the hardware in the last year. This was working at least 4 months ago. Looks like the Cluster manager doesn't always listen to its own settings.

Thanks for the replies on this question.

Insomnia
  • 507
  • 4
  • 10