Odd one-way ping issue that I can't wrap my head around

Question

Long time lurker, but today I encountered an odd problem that will bug me until resolution :-)

It seems to be presenting as a one-way ping issue from one server to a failover cluster.

All machines are running Windows Server 2008 R2 with IPV6 disabled. The windows firewall service is disabled.

Lay of the land:

Report Server - VMWare Virtual Machine using E1000 NIC. Nothing special - IP, Subnet, Gateway and routing table all appear sane.

SQL 2008R2 Active/Passive Failover cluster - Each has 7 configured NICs- 3 iSCSI, and the remaining 4 bound to 2 IPs with BACS. One NIC Team is used for local traffic and the other as part of the failover cluster. The failover cluster has a VIP.

Problem:

All was working fine last week. All machines are on the same subnet. Today, the report server couldn't ping the VIP of the failover cluster. It could ping both nodes without issue, using both non-storage IP addresses.

The SQL failover cluster could ping the report server without any issue.

I can ping the SQL VIP from any other machine, vindicating it in my mind.

The Band-Aid

I tried rebooting the report server in the event that TCP/IP was misbehaving to no avail. What ended up working was changing the Report Server IP address - As far as I know there are no host rules in place on the switch (Catalyst 3750).

What could cause this one? I'd say the ARP table was cleared after the report server rebooted, and the IP address shouldn't have become stale on the DB cluster... looking for someone with more networking know-how than I :-)

Do you have access to the 3750? Is the software fairly recent? — rnxrx, Jul 24 '12 at 01:25
Have access to the switch, it's running 12.2(53)SE2. Not terribly up to date, but not ancient. Other servers on this switch have been unaffected. — WinAdmin, Jul 24 '12 at 01:31
And no changes networking-wise.. that I'm aware of - we have a dedicated networking guy, but he usually at least informs me prior to making changes.. — WinAdmin, Jul 24 '12 at 01:37
Does it get an ARP response when it attempts to communicate with the cluster's VIP? — Shane Madden, Jul 24 '12 at 01:45
I don't recall. At this point, I've polluted the test by both rebooting and changing IPs. Later tonight, I can switch the machine back to its old IP and test. What would be the best test for arp response in a Windows environment? — WinAdmin, Jul 24 '12 at 02:09

score 1 · Answer 1 · answered Jul 24 '12 at 04:34

Facepalm.

I know what caused it, although I may need help on the explanation. In troubleshooting tonight, I spun up another server and had it assume the Report server's IP address- this brand new server running Windows Server 2008 R2 could NOT ping the VIP.

Well, that's strange. And again, it could ping either of the nodes by name. I looked at the arp tables, and it seemed sane - I hopped on the active DB node to check the MAC address and noticed that the checkbox for IPv6 was ticked. I unchecked it, and it instantly resolved the problem.

Question becomes - why? I missed the IPv6 in the configuration of the cluster, that's for sure... but this cluster has been in production for 3+ months with no apparent issues before today. This node has been the active node for more than 3 weeks.

Does anyone have experience or an explanation of how something so good became so bad? :-)

Odd one-way ping issue that I can't wrap my head around

1 Answers1