2

I've setup a 2-nodes vSphere cluster. Each node is equipped with 4x1GE NICs. I have set a single vSwitch in each node, using all 4 vmnics as adapters and the following ports:

Management: vmKernel port -> Active on vmnic0, Standby on vmnic1-2-3

vMotion and FT -> Active on vmnic3, Standby on vmnic0-1-2

Workload -> Active on vmnic0-1-2-3

Route based on originating virtual port.

I know the solution is not ideal, as best practices suggest to have a physically separated network for vMotion/FT, but still.

I noticed that no VM is mapped to vmnic3, so this appears to be used by vMotion/FT only.

However when FT is enabled (on a dummy WinServer machine doing nothing), I notice the following issues:

1) pings to/from that machine go unstable (up to 5ms)

2) capturing stats on the physical switch, I noticed the port physically connected to vMotion/FT NIC has an input rate of 300Mbps (which is expected), but I also notice the output rate of all ports connected to other vmnics go 300Mbps, like the physical switch flood FT traffic on all other ports. When disabling FT, traffic go back to small values on all NICs.

Why points 1 and 2 above?

EDIT: All ports are in the same VLAN. I know this is far from ideal, but still can't explain points 1 and 2 above

kuma
  • 158
  • 9

2 Answers2

1

The traffic is called 'unicast flooding' and it happens when the switch is unsure where to direct the packets to.

There is a known problem with unicast flooding occurring when vMotion ports are not isolated in their own VLAN. VMware are not as clear about this as they should be and there is a good blog post on it here: http://virtuallyhyper.com/2012/03/vmotion-causes-unicast-flooding/.

Your assignments of NICs to the services is fine, but you must use a dedicated VLAN for vMotion traffic. Your switchports would need to be trunk ports to accommodate this.

Westie
  • 23
  • 6
  • interesting. However I don't experience flooding during vMotion (manually migrating a VM) but only when Fault Tolerance is on. Is this expected? – kuma Jun 07 '19 at 10:05
0

Not enough bandwidth for Fault Tolerance. VMware recommends at least 10 Gb/s links. In the lab, even a 2 vCPU VM needs about 2.5 Gb/s just for FT, in some scenarios. That is what is required to replicate the change rate of RAM.

Re-examine your recovery time requirements. Non-FT VMware HA can boot a VM on a different host with a couple minutes downtime.

If you are serious about FT, consider dedicating 3 or 4 of those 1 Gb links to FT, or upgrade to something like 25 Gb Ethernet.

John Mahowald
  • 32,050
  • 2
  • 19
  • 34
  • how this could explain the flooding? – kuma Jun 07 '19 at 10:05
  • Bandwidth alone does not explain the flooding. Probably the trouble is introduced by sharing switchports for different purposes. But you will want to dedicate more than 1 Gb to FT when you fix that. – John Mahowald Jun 07 '19 at 13:15