My company has designed an application for our in-house processes that runs on about 50 virtual machines. This has been running for over 5 years now and at the beginning of the year we setup a new server cluster for our new Microsoft 2014 database. Things were running great for about 9 months and the last 3 months we have been experiancing a very strange issue
Randomly one or two of the 50 machines will start seeing the following error.
Unhandled Error: A network-related or instance-specific error occurred while establishing a connection to SQL Server. The server was not found or was not accessible. Verify that the instance name is correct and that SQL Server is configured to allow remote connections. (provider: TCP Provider, error: 0 - The wait operation timed out.)
The processes will then expire and usually about 30 - 60 minutes after that, it will be able to connect back up to the server like nothing ever happened.
- Rebooting the effected machines will not fix the issue, have to wait until the issue is gone.
- During this time, we can not ping the cluster name or cluster IP, while other machines still can.
- We can't telnet to the SQL port, while other machines still can
- The effected machines can still access other network resources, just can't access this cluster
- SQL has maximum number of concurrent connections is set to 0 for unlimited and timeout is set for 10 minutes.
- We haven't found anything consistant on the application machines since the issue will just randomly show up on all of them, but will only ever effect 1 or 2 machines at a time and can take hours or days to reshow up.
At this point we have no idea what's going on, and we are looking for any ideas that could fix this.