0

I'm seeing occasional (2 or 3 times/day) "connection drops" on a Windows Server 2008 R2 physical server, running on a Dell R710. I use the term "connection drops" because I don't know how to describe it otherwise, but I mean the following:

  • Server stops responding to ping
  • Any RDP connections (or other types of remote connections) will stall and eventually time out
  • Any connections to the SQLS database or IIS running on this server will stall/time out

This seems to last anywhere from 30 seconds to 1 minute. After that, the server comes back up, responds to ping and just resumes all of its services as if nothing ever happened.

This server runs the following services:

  • SQL Server 2005 database (2 databases and reporting)
  • IIS7 web server (running 2 custom services and 1 reporting site)

Obviously, I'd like to find out what is causing this. There is nothing in the server's event logs or other monitoring parameters that I can see that indicates any issue in particular. Any tips on how to try to narrow down what is causing this issue?

It's worth taking into account the following facts:

  • We have 5 other servers (of which 3 R410's) running in the same rack, on the same network, none of which seem to display this issue
  • The handles count from the performance view in process manager sits around 40,000 handles (of which lsass.exe seems to take ~7000)
  • I've tried to restart the IIS to see if the custom services are somehow causing this; this means I shouldn't see this issue in the next couple of days/weeks

Update 1: DRAC is still accessible when this issue occurs. This is a very strange issue. I think we'll have to trial & error this by trying various solutions and checking the results.

Update 2: I have spoken to the network guys, and they confirmed that for some reason our server's MAC address is repeatedly being removed from the switch's ARP table. The exact cause of this is yet unknown (it could be a dodgy cable connecting the server to the switch, or the NIC going to sleep all the time). We've started a continuous ping to the default gateway, and are looking to replace the cable.

Skyhawk
  • 14,200
  • 4
  • 53
  • 95
pHk
  • 121
  • 1
  • 4
  • Is this server virtual or physical? If you setup to log pinging requests from the server to the outside, do you have the same issue? Any updates to drivers or firmware for the servers? – Nixphoe Jul 19 '11 at 19:29
  • Does this server have iDRAC? Is it enabled? – ITHedgeHog Jul 19 '11 at 19:31
  • @Nixphoe It's a physical machine. I'm sure there are newer updates for the BIOS and/or NIC drivers and things like that, but I would like to be a little more sure about the cause of this before I start taking this production machine down all the time for maintenance that may not solve anything. – pHk Jul 19 '11 at 19:59
  • @ITHedgeHog I believe so, I'd have to check (do they not come with DRAC by default these days?). – pHk Jul 19 '11 at 20:01
  • @pHk: What process is eating up the rest of the 40K handles? That's a bit of a high number. – Evan Anderson Jul 19 '11 at 20:24
  • 1
    I guess you've tried this already: You may try to connect the 'faulty' server to another port of your rack switch. I experienced very weird issues with single defective ports on switches. – desasteralex Jul 19 '11 at 23:49
  • @Evan From the process list in taskmgr I can see that lsass.exe is holding 7225 handles, dns.exe 5319 handles and then 1600 for the SQLS followed by various other stuff. Apparently, this is also a DNS server (which is news to me). Sounds like I need to take this up with the original admin of this server. – pHk Jul 20 '11 at 07:03
  • @desasteralex We haven't tried that actually; this is something I will recommend to the network guy when he gets back. I'm just doing some preliminary research into this while he's away on holiday. – pHk Jul 20 '11 at 07:04
  • It sounds like it's a Domain Controller on top of everything else it's doing. That won't cause the problems you're seeing, however it is odd to be using a DC as a database server and web server. – Evan Anderson Jul 20 '11 at 12:26

2 Answers2

1

If you're using multiple NIC's on this machine the make sure that you only have one default gateway defined.

We had a problem like this recently and it transpired that the NIC's used for the back end networks (192.168.x.x) had default gateway's specified.

Kev
  • 7,877
  • 18
  • 81
  • 108
  • How would the default gateway on multiple NIC's cause it to drop out occasionally? – pHk Jul 19 '11 at 20:02
  • We had traffic being routed to the wrong LAN. As in the public facing NIC had the border router gateway ip address, and the private facing nic had a gateway address on the private lan. – Kev Jul 19 '11 at 20:04
  • Kev is right, it's a pretty common issue dealing with Dual NICs on Server 2008 (x) where dual NICs occasionally have an issue with mixing up the routing information. If I remember correctly, it has something to do with M$'s handling -or in this instance; incorrect handling- of Internet Connection Sharing (ICS) feature. – Get-HomeByFiveOClock Jul 10 '14 at 10:43
0

If you are logged in to the console is it still responsive?

Do a packet capture on the NIC's of the affected machine. Wireshark or Netmon. This will tell you what's happening with the machine's TCP/IP traffic during this time.

Willis
  • 1