ESXi 5x cluster Hardware Failure Scenario

Question

Hello fellas engineers.

I have a ESXi5.0 cluster setup with 3 ESXi hosts. Now I need to create a test case for networking hardware failure and preform the test in the datacenter.

My Setup:

    1) 3 DELL R820 Servers (all identical in the configuration and hardware)

    2) PHYSICAL: Pair of 1GB ports for vSphere Management Network (active/standby)
       VIRTUAL: 1 VMkernel Port vmk0 on standard vSwitch0

    3) PHYSICAL: Pair of 10GB ports for regular network communications between guests MESH(active/active using IP Hash load balancing connected to the redundant switches) 
       VIRTUAL: dvSwitch0 with exposed and needed VLANs.
    4) PHYSICAL: Pair of 10GB for storage NFS/VMDK (active/passive, Failover Only with "Link Status Only" network failure detection connected to different switches)
       VIRTUAL: 1 VMkernel port vmk1 connected to distibuted switch dvSwitch01
    5) PHYSICAL: Pair of 10GB for storage (guest initiated) (active/active, load balancing is based on Port ID with "Link Status Only" network failure detection connected to different switches)

HA and DRS enabled.

I was planning just do regular pull cable test but might be missing some factors. I would appreciate any suggestions and/or best practices to perform such a test.

score 4 · Accepted Answer · answered Aug 28 '13 at 15:30

- Power off a host. - To test high-availability and admission control.

- Power off a switch. - To test failover links.

- Disconnect data and storage network cables independently. - To test resiliency, load balancing and datastore heartbeat/host isolation state. Also storage controller failover.

score 1 · Answer 2 · answered Aug 28 '13 at 15:30

When we test failure scenarios we start by removing individual wires/fibres, then whole NICs/HBAs, then servers, then switches - i.e. small to large - simply because if the platform can't handle the small then testing on the large will be pointless.

That said I can't see any issues with your setup, not as you've explained it anyway.

score 0 · Answer 3 · answered Aug 28 '13 at 17:27

I more go on the big approach - unplug a server and on the next try storage and on the last one a switch - or in any other order - if the system survives that -> all good. But if you do have a lot of time (and someone to pay for it), you can try each small problem on it's own...

Tsg

ESXi 5x cluster Hardware Failure Scenario

3 Answers3