4

I'm running a load test on my system. At a certain level of load, I start getting SQL errors in my log:

System.Data.SqlClient.SqlException (0x80131904): A network-related or instance-specific error occurred while establishing a connection to SQL Server. The server was not found or was not accessible. Verify that the instance name is correct and that SQL Server is configured to allow remote connections. (provider: Named Pipes Prprovidererror: 40 - Could not operrorconnection to SQL Server) ---> System.ComponentModel.Win32Exception (0x80004005): The network path was not found

By running performance monitor on the SQL server in question, I found the following:

  • CPU level rarely exceeds 50%. (On a previous iteration I saw that it was maxing out at 100%, so I increased the specs of the VM, which helped push the problem to a higher load level.)
  • Number of user connections got to a shade over 8,000. The Sql Server has the default setting of 32,767 max connections.
  • The connection string specifies a max pool size of 1000 connections to each database, and there are 100 databases on the server. The load test is randomly distributed between the 100 databases, so there should be a fairly even distribution, meaning about 80 connections per database. Nowhere near the 1k limit.

What other factors might cause Sql Server to stop being able to accept connections?

UPDATE: extra info: I'm using Entity Framework Core (EF7) for my DB connections, if that helps any.

user207421
  • 305,947
  • 44
  • 307
  • 483
Shaul Behr
  • 36,951
  • 69
  • 249
  • 387
  • Network settings? Did you verify on Sql Server Neetwork Settings that TCP/IP is enabled and tha the IP are enabled too? https://technet.microsoft.com/en-us/library/hh231672(v=sql.110).aspx – NicoRiff Jan 05 '17 at 14:45
  • @NicoRiff The problem only shows up at high loads. It works otherwise, so it can't be network settings, unless network settings have some kind of throttling set. – Shaul Behr Jan 05 '17 at 14:52
  • @ShaulBehr, any related errors in the SQL Server error log? How many client machines? Do any of the individual CPU cores on the DB server consistently show much higher utilization than the others? – Dan Guzman Jan 05 '17 at 15:01
  • @DanGuzman Nothing in the SQL error log. 42 client machines. How do I find out about the individual CPU cores? – Shaul Behr Jan 05 '17 at 15:07
  • Oh, NM, obviously, using Resource Monitor... well, I guess I'll have to rerun my tests to find out. What would it mean if a couple of CPUs are being overloaded? – Shaul Behr Jan 05 '17 at 15:08
  • @DanGuzman and the answer is no; the CPUs on the SQL server are pretty much sharing the load. – Shaul Behr Jan 05 '17 at 15:38
  • Do you want to solve the problem or get an explanation? To solve the problem connect using TCP/IP instead of Named Pipes. Explanation: Who knows? – Ben Jan 09 '17 at 12:43
  • @Ben I want to solve the problem. Please can you post an answer showing explicitly what you are proposing? – Shaul Behr Jan 09 '17 at 12:47
  • Also have you checked `sp_who` to see how many current connections there are? Possibly you are leaking connections and exceeding the 32,000 limit – Ben Jan 09 '17 at 12:48
  • @Ben I am running `select d.name, count(1) from sysprocesses p join sysdatabases d on p.dbid = d.dbid group by d.name order by count(1) desc` . This is how I know how many connections I have to any one database at any given time. And Performance Monitor to tell me the total. – Shaul Behr Jan 09 '17 at 12:49
  • Maybe your test is seen as a network attack (syn, ddos, etc.)? the error seems to indicate the problem is lower than SQL server (named pipe/network). Or may be simply some max named pipe connections reached. – Simon Mourier Jan 10 '17 at 07:24
  • @SimonMourier can you please flesh that out a bit? Any idea how I could verify that, or try a different type of connection? – Shaul Behr Jan 10 '17 at 07:37
  • what does your connection string look like (w/o passwords :-)? – Simon Mourier Jan 10 '17 at 13:28
  • @SimonMourier Like this: `Data Source=myazureserver.cloudapp.azure.com;Initial Catalog=MyDb;User ID=MyUser;Password=MyPwd;Max Pool Size=1000;MultipleActiveResultSets=True;Application Name="My App"` – Shaul Behr Jan 10 '17 at 14:52
  • perhaps related to http://stackoverflow.com/questions/33857505/azure-sql-database-sometimes-unreachable-from-azure-websites – chue x Jan 10 '17 at 21:08
  • Along the lines of the comment by @Ben above, you can slightly simplify the network aspect of troubleshooting by only enabling TCP/IP (and disabling named pipes). The issue could be arising in one of many layers. It could be something in Azure that does throttling (nothing in Azure logs?). It could be something peculiar to SQL Server on Azure. You might never find out so you may have to simplify the configuration to help with troubleshooting. Regardless, an internet facing application server should have a small surface area as possible – Nick.Mc Jan 16 '17 at 03:36

4 Answers4

8

"Network Path Not Found" does not seem like an error related to SQL Server's capacity. As a former "IT Guy," I suspect that a firewall is dropping your packets. If this is during a stress test, the firewall could be interpreting the numerous requests as a denial of service attack, and using some sort of predefined rule to drop connections for a specified time period.

What is your network environment? If you have a hardware firewall or router with IPS capabilities, I would check those logs to see if you find a smoking gun. You might have to create a special rule to allow unlimited traffic to your SQL Server.

Dave Smash
  • 2,941
  • 1
  • 18
  • 38
  • Edited question: I'm using EF Core – Shaul Behr Jan 11 '17 at 12:50
  • If it is a firewall issue, then it wouldn't matter what style of coding you are using. If this is an Azure VM, I would try disabling its Windows Firewall next time you reproduce the problem and see if it starts working again. If you have access to another computer outside of your physical location, you can try it from that computer when you reproduce the problem to eliminate firewall/network issues on your computer or location. If that doesn't point you in the right direction, I would use WireShark to do a packet capture. Look at traffic on port 1433 and see if you get a useful error msg. – Dave Smash Jan 11 '17 at 17:23
  • Just be aware that named pipes is on port 445, not port 1433 – Nick.Mc Jan 16 '17 at 03:37
  • Don't have a conclusive answer, but I'll give you the bounty on the basis of most upvotes thus far – Shaul Behr Jan 16 '17 at 16:19
  • @ShaulBehr It's your bonus, but I save mine for the answer that fixes my problem. It's not a popularity contest. – user207421 Jan 17 '17 at 00:17
  • @EJP except that my week was up, so I had to award it to *someone* :-) – Shaul Behr Jan 17 '17 at 09:58
  • Well, for what it's worth, I appreciate it. Feel free to reach out if I can be of any assistance in tracking down the issue. – Dave Smash Jan 17 '17 at 14:59
  • Well, we thought the error had gone away, but it hasn't. And now it's posing a practically existential threat to our organization. Have you any other ideas? – Shaul Behr Oct 24 '17 at 12:13
  • When you posted initially, I got the impression that you didn't have firewall access or weren't able to fully investigate network-related causes - this still doesn't sound like a C# or SQL Server issue, but an environmental one. If this is an existential threat, how about migrating (temporarily or permanently) to a cloud-based server on Azure or someplace and seeing if that resolves the issue? I think it will, and if it doesn't, you'll have their tech support department to lean on... @ShaulBehr – Dave Smash Oct 24 '17 at 15:49
  • I think we've got it. It's not a problem on SQL Server; it's a problem on the API server. We were spinning up hundreds of parallel threads, which basically brought the API to its knees. – Shaul Behr Oct 24 '17 at 19:00
1

It's a bit curious that you are getting that many connections to the database. You should be utilizing connection pooling; even under intense load, the connection pooling should greatly reduce the number of active connections being used.

Can you provide the code that's accessing the database? Are you calling the dispose() method or closing the connection?

Also, have you looked to see if data datacaching would ease the db load? A 2-5 second datacache can greatly reduce database calls.

Paul Tsai
  • 893
  • 6
  • 16
1

You are running into the TCP listen() backlog limit for the SQL-Server's listening port. When this happens, Windows platforms (but not *nix platforms) will issue 'connection refused' for further incoming connections.

I'm not an SQL-Server guy but there is bound to be a parameter somewhere by which you can increase its listen backlog.

Alternatively you should look into better or more connection pooling at the client.

user207421
  • 305,947
  • 44
  • 307
  • 483
  • This is intriguing. I guess if this is happening there should be some kind of message in the windows event log. – Nick.Mc Jan 16 '17 at 03:38
  • Do you have any references for this "TCP listen backlog time limit"? – Shaul Behr Jan 16 '17 at 07:32
  • @ShaulBehr I didn't say anything about a "TCP listen backlog time limit", but the behaviour I described is well known. – user207421 Jan 16 '17 at 10:20
  • I found a few hits on google under "TCP listen backlog limit windows" but first I would go into the windows event log and see if there are any messages to this effect – Nick.Mc Jan 16 '17 at 12:07
  • @Nick.McDermaid which log in particular? I looked under Application, Security and System, and found no error/warning/critical entries that recurred at the rate I was getting errors – Shaul Behr Jan 16 '17 at 12:56
  • I don't even know if it's a real thing, I'm just running with it. I googled and there is lots about it... bit it's for windows XP and its about open ports, not a queue. http://serverfault.com/questions/51597/how-to-fix-tcp-ip-has-reached-the-security-limit-event-message – Nick.Mc Jan 16 '17 at 13:27
  • @ShaulBehr I would *not* expect to find anything in any Windows log about this. It happens deep inside the kernel. – user207421 Jan 16 '17 at 17:21
  • @Nick.McDermaid Your link has nothing to do with the TCP listen backlog queue whatsoever. – user207421 Jan 16 '17 at 18:17
  • Ok. Looks like this one is a dead end – Nick.Mc Jan 16 '17 at 21:21
  • @Nick.McDermaid You mean *your link* is a dead end? – user207421 Jan 16 '17 at 23:02
  • My link is definitely a dead end but also it doesn't look like any more information is forthcoming in this line of enquiry, if you're not going to offer anything and my random googles can't turn up anything. – Nick.Mc Jan 16 '17 at 23:05
  • @Nick.McDermaid 'Offer anything' such as what? I've already stated that I'm not an SQL-Server guy but what I described is far too well known to TCP programmers to need a reference. – user207421 Jan 16 '17 at 23:29
0

It turns out the problem wasn't on SQL at all. The problem was on our API server, where some of the APIs were spinning off hundreds of parallel threads, each making its own connection to the database. The load was simply too much for the API server, and it started returning "Access Denied" exceptions without even really attempting to connect to the database.

Solution: we throttled the number of threads being spun off, using the pattern shown in this answer.

Shaul Behr
  • 36,951
  • 69
  • 249
  • 387