0

I am troubleshooting network problems for a client, remotely. The problem they have is that they get "connection timed out" occasionally between a web server and and a back-end search server. They can reproduce this behavior easily using a staging server. I have asked them run Wireshark on both servers and what I find is that SYN packets are sent, over and over again. And often they are not visible on the receiving end. I am wondering what you guys think the reason could be?

My first thought was the firewall that is between the servers. Now they say they connected the back-end search server to the same network as the web server, which makes me puzzled.

More details: I assume the servers are Windows Server 2008. I have never been to the client's location. The web server is using WCF with Transport Security turned on to access the back-end servers. They seem to be able to rule out heavy load as these problems appear also at light load.

For me it sounded obvious that it has to be something in the network that causes the SYNs to not appear at the destination but now they say they have turned off firewall rules, turned off Windows firewalls and even put the servers on the same network. And I'm clueless.

Update: The latest test they've made is to run a console app (simulating repeated web requests) on a server on the same subnet as the search server. And both servers run as VMWare instances.

Ideas?

LinusK
  • 111
  • 3

2 Answers2

0

Are you saying that both the web server and the search server are on the same subnet? Attacking this from a network point of view, use IP addresses only while troubleshooting to rule out any shenanigans with incorrect DNS entries, etc.

To help my sanity I'm going to say the web server has IP w.w.w.w and the search server has IP s.s.s.s

The fact that wireshark does not see the SYN arrive rules out a firewall issue, wireshark should be observing the packets arriving on the interface before the firewall gets a chance to do anything with them.

The first thing I would do is check what the entry in the arp cache on the web server is for s.s.s.s. On most platforms this is just arp -an on command line Then I would check that the mac address of the search server matches this. If it does not, then it is likely that there is another device on the network that has the same IP as the search server and they are fighting for it.

Another angle would be to set a continuous ping going between the servers to see if it reveals any packet loss. That might imply a cable problem, or duplex mismatch, but from your description it doesn't seem that likely. Is it possible to get on the switch and check the interfaces for errors? Presumably if they are virtual this would affect all servers on the same VHost... so again, it doesn't seem likely.

Perhaps the VHosts have some kind of interface bonding set up that isn't quite working right? I have seen instances where a misconfigured switch port on the end of one of six ESX interfaces caused some interesting side effects.

A more complicated scenario might be that there is a 'bump in the wire' device between the two servers - perhaps a layer 2 load balancer, a layer 2 firewall, or an IPS of some description. Any of these devices have the potential for blocking frames between the servers. I would hope that your client would have mentioned this though!

paulos
  • 1,694
  • 10
  • 12
  • Interesting point about arp cache. I will check that out, although it might take some time before I am able to. I also learnt just now that they are running both machines as VMWare instances. Not sure though if it's on the same host or not. Will investigate that further. – LinusK Oct 05 '11 at 11:33
0

Possible reasons:

1) rate-based filtering on switches/routers

2) frame drops due to bad cable/NIC or congestion

gelraen
  • 2,341
  • 20
  • 19