0

My company has designed an application for our in-house processes that runs on about 50 virtual machines. This has been running for over 5 years now and at the beginning of the year we setup a new server cluster for our new Microsoft 2014 database. Things were running great for about 9 months and the last 3 months we have been experiancing a very strange issue

Randomly one or two of the 50 machines will start seeing the following error.

Unhandled Error: A network-related or instance-specific error occurred while establishing a connection to SQL Server. The server was not found or was not accessible. Verify that the instance name is correct and that SQL Server is configured to allow remote connections. (provider: TCP Provider, error: 0 - The wait operation timed out.)

The processes will then expire and usually about 30 - 60 minutes after that, it will be able to connect back up to the server like nothing ever happened.

  1. Rebooting the effected machines will not fix the issue, have to wait until the issue is gone.
  2. During this time, we can not ping the cluster name or cluster IP, while other machines still can.
  3. We can't telnet to the SQL port, while other machines still can
  4. The effected machines can still access other network resources, just can't access this cluster
  5. SQL has maximum number of concurrent connections is set to 0 for unlimited and timeout is set for 10 minutes.
  6. We haven't found anything consistant on the application machines since the issue will just randomly show up on all of them, but will only ever effect 1 or 2 machines at a time and can take hours or days to reshow up.

At this point we have no idea what's going on, and we are looking for any ideas that could fix this.

Patrick
  • 147
  • 3
  • 11
  • 1
    "We can't telnet to the SQL port, while other machines still can" is a major clue. when it happens try and rune the stored procedure sp_who regardless of max concurrent users set to 0 there is a limit of 32,767 - however your point 2 'can't ping the cluster' points to an intermittent network issue – Sum1sAdmin Dec 20 '17 at 17:34
  • Yup. Ping and an attempted TCP open connection (telnet to SQL port) failing makes it look like that this is a network problem. Not necessarily "the network", although that might be it. It could also be the network stack on either the client or SQL server, which goes from the OS to the physical layer. It's unlikely to be a SQL problem as such. – mfinni Dec 20 '17 at 17:44
  • @Sum1sAdmin SP_Who brings back 909 rows. I'm assuming there isn't a limit to how many rows it pulls so that should be everything. – Patrick Dec 20 '17 at 19:25
  • @mfinni you still think it's a network issue when it can get to other systems just fine? That is the one of the major reasons why i was leaning against network, since i figured if it was a network issue it would be affecting all connections that machine tries to make. – Patrick Dec 20 '17 at 19:28
  • if, while the affected client cannot ping the server, other non effected servers can, then you have ruled out high network I/0 and letency on the server, it's not max concurrent connections etc. I would start looking at tcpdump/wireshark. The error itself reads 'network related' - btw what sort of cluster is it? – Sum1sAdmin Dec 21 '17 at 10:50
  • @Sum1sAdmin This is a windows server 2008R2 fail over cluster. I will start running wireshark on this machine to see what is happening. – Patrick Dec 21 '17 at 13:29
  • @Patrick Yes. As I said, it doesn't mean that your network switch(es) is/are failing globally. You could have a bad physical interface and be dropping packets. You could be exhausting TCP ephemeral ports in the client or server OS, or running out of buffers on the switch port or backplane. Wireshark won't find all of those problems unfortunately, you'll also need to inspect the switch and ports. – mfinni Dec 21 '17 at 19:57

0 Answers0