1

Two boxes with identical loads serving the same sites tend to slow down and stop responding to ping. The slow (or intermittent) ping causes our load balancer to think the servers are offline and disable them. There is a third server with identical content that does not have the issue, so I'm fairly confident it's not the sites.

OS is Windows Server 2008. Configuration is a little special: since we're using the Barracuda Networks load balancer in Direct Server Return mode, we've had to configure a number of loopback adapters which "fake" the IP as described here. The physical adapter has forwarding set to enabled as required by 2008 to get the loopback adapters functioning.

Symptoms:

  • When it occurs, ping usually either times out, or drops packets.
  • Fixes seem to be one or more of the following:
    • Logging in via remote desktop.
    • Clearing the dns cache or the arp cache (not sure which).
    • Restarting.
  • After one or more of the above, the server seems fine for about 4 hours before acting up again.

Question:

What possible reasons are there for this? What should I try to diagnose this? I haven't ruled anything out. Switch configuration, domain/dns server, all ideas are welcome.

Sadly, I have very little knowledge of good network administration, so obvious answers are welcome too.

EDIT:

In answer to some of the questions posed.

I have contacted Barracuda and they seem to be of the opinion that the problem is related to the network. I think I agree at this point.

The IP is assigned to a physical interface, not shared between servers. Pinging is done from within the same subnet.

The third box handles all the site load when the other two go down and hasn't had much problem with it, but occassionally it too has trouble. I haven't found a pattern with that one yet.

This evening I sat down with another (more experienced) network guy to look through some of the domain and server configurations. One of the things he found was a bad dns setup on the domain controllers. They were configured with external dns servers as their alternates rather than the other DC. We switched them to reference each other for dns, and added forwarding to the dns service. We also removed external dns references from all the web servers.

EDIT 2:

With Wireshark I was able to examine the ICMP traffic during one period of down time. I began this test because I could not reach a shared folder on box 2 from box 1.

Test:

  1. Start capturing traffic on box 2.
  2. Observed that box 2 was seeing and replying to pings from the Barracuda Load Balancer.
  3. Logged into box 1 and pinged box 2.
  4. Observed that box 2 saw but DID NOT reply to pings from box 1.
  5. Observed that box 2 saw but DID NOT reply to pings from the LB for a period of 100 seconds after the first ping from box 1.

So somehow traffic between the two boxes is causing box 2 to crap out on ICMP for a period of time.

I should note that box 1 was working fine throughout this test, but did not see any requests from box 2. While pinging box 1 from box 2, Wireshark on box 2 showed a message "Destination unreachable (Communication administratively filtered)" from a source IP I did not recognize.

Joel
  • 133
  • 1
  • 3
  • 8

7 Answers7

3

Do you need to use ICMP ping for your server testing? HTTP requests are supported by most load balancers, and are usually a better idea, as your web server can be down while your network card is still up.

Tim Howland
  • 4,728
  • 2
  • 27
  • 21
  • Even though this doesn't address the actual problem, it's a good point. Load balancing health checks ideally match the type of traffic they're balancing. This reduces the risk of false positives/negatives. – sh-beta May 14 '09 at 03:51
  • Good point. Not relevant to this problem, but definitely something I will look into when I get our current setup resolved. – Joel May 14 '09 at 16:06
1

I would check with Barracuda Networks first. This may be a known issue. We had a similar problem that turned out to be our Cisco load balancer. A firmware update fixed the issue.

Joseph
  • 3,797
  • 27
  • 33
  • Have checked with Barracuda but they are of the opinion (and I tend to agree) that it is a network issue. – Joel May 14 '09 at 01:43
1

Is the third server under load, or is it unique from the other two in another way?

Without knowing more, I'd suggest getting Wireshark onto these servers while pinging them and taking a look at the ICMP activity. My (possibly unfounded) suspicion is these servers are having ARP trouble and sending response packets back, you're just never getting them.

With Wireshark, set your filter to "arp or icmp" and see what it brings up. You should also take a quick look at your System event logs - there might be something obvious in there that short-cuts any further guesswork.

If you're not familiar with arp, it's the protocol for translating layer 3 (IP) addresses to layer 2 (MAC) addresses. This has to happen correctly or the layer 2 frame containing the layer 3 packet will either never be sent, or will arrive at the wrong destination.

Finally, the other posters' duplexing/speed recommendations are solid best practice, though I doubt they're the root cause here. Note that in gigabit ethernet, you no longer need to worry about autonegotiation sucking.

EDIT

The DNS changes you made are a good idea for sure, but I have a hard time imagining a scenario where that would lead to ICMP timeouts. Possibly the app is blocking on thousands of DNS queries and eating up its resources so much that it can't respond to ICMP?

Anyway, if this doesn't resolve the issue the packet traces should show more of what's going on.

sh-beta
  • 6,838
  • 7
  • 47
  • 66
  • Please see my edits for answers to your questions. – Joel May 14 '09 at 01:42
  • Updated my post accordingly. – sh-beta May 14 '09 at 03:47
  • Right now things are quite. No problems since we made the DNS changes. I've installed Wireshark, and will start diagnosing with it tomorrow if I'm able to reproduce the problem. I am slightly familiar with arp (at least the concept), but not how it works. The Barracuda rep I spoke with said that often issues such as this clear up with clearing the arp cache on the switch. – Joel May 14 '09 at 16:05
  • I was able to capture some traffic today. I have edited my question with the new data. – Joel May 18 '09 at 18:03
0

What was the source IP that did the administrative filtering? Most likely that is the source of the problem and I would suspect it is internal to the Load Balancer

Kevin
  • 161
  • 4
0

One thing that I've found helps is to ensure the NIC on the server and the port on the switch it is connected to are both set to the same speed and duplex settings. I've had trouble with "auto negotiate" not negotiating very well which starts causing a lot of errors on the port and the NIC.

palehorse
  • 4,299
  • 5
  • 29
  • 27
0

Try to set your interfaces to a speed manually, and avoid using auto-negotiate when possible.

WerkkreW
  • 5,969
  • 3
  • 24
  • 32
0

Update the network drivers on your servers to the latest version provided by your hardware vendor. I find this sometimes fixes weird network issues.

rmwetmore
  • 432
  • 1
  • 5
  • 10