1

I've setup an Windows Server Failover Cluster in a test environmnet, and installed a clustered SQL Server 2014 instance on it. The cluster has two nodes, and the network name is SQLINSTANCE.

Now from small console application i have a simple loop (which runs fine) that queries a test database every 100ms:

while (true)
{
   using (var dbContext = new TestDBContext())
   {
       foreach (var person in dbContext.People)
       {
            Console.WriteLine(person.FirstName + " " + person.LastName);
            System.Threading.Thread.Sleep(100);
       }
    } 
 }

Now i force a failover on the cluster (either by shutting down one node, or moving the SQL server role to the other node from Cluster Manager). The failover process takes about 50 seconds. When failover is comlete, my test-loop throws an EntityException, saying "the underlying provider failed on open". The InnerException is a SqlClient.SqlException, saying "Cannot open database "TestDB" requested by the login. The login failed. Login failed for user 'Domain\User'."

I've setup my connection string like this:

 <add name="TestDBEntities" connectionString="metadata=res://*/Model1.csdl|res://*/Model1.ssdl|res://*/Model1.msl;provider=System.Data.SqlClient;
         provider connection string=&quot;data source=sqlinstance;initial catalog=TestDB;integrated security=True;
         MultiSubnetFailover=true;
         connect timeout=150;
         ConnectRetryCount=15;
         ConnectRetryInterval=10;
         MultipleActiveResultSets=True;App=EntityFramework&quot;" providerName="System.Data.EntityClient" />

What i was expecting is that the SqlClient tries to reconnect to my clusterd instance for 15 times, with 10 secondes interval, and then successfull reconnects when the failover is complete. But it's still keeps throwing exceptions.

I tried to change the "Connect Timeout", "ConnectRetryCount" and "ConnectRetryInterval" values, but still there's no success full reconnect after failover...

I'm very confused, what am i missing here, or can i not rely on the underlying SqlClient and do i have to provide my owm retry logic in the DAL for example???

Jeroen1984
  • 1,616
  • 1
  • 19
  • 32
  • And the exception are? I'm asking because probably failover is complete, but not all database are online... – user_0 Dec 18 '14 at 12:40
  • It's an EntityException, saying "the underlying provider failed on open". The InnerException is a SqlClient.SqlException, saying "Cannot open database "TestDB" requested by the login. The login failed. Login failed for user 'Domain\User'." It gets throwed the moment failover is complete. Btw i also eddited my quiestion with this detailed exception information... – Jeroen1984 Dec 18 '14 at 13:06
  • The problem is the cannot opend database. Why this appens? Everytime you restart SQL Server, it need to do some check on database. Depending from size and pending transactions, it can take from seconds to days. You can verify this in error log. You will find messages like "Starting up database 'dbname'" and others giving you informations on status of check. – user_0 Dec 18 '14 at 13:28
  • @user_0, Ok that sounds reasonable. So the instance is up, my app can establish a connection, but the database is not available yet... Do you know any solutions for this, maybe built in the .NET Framework? I read something about EF6's Execution Strategies and the SqlAzureExecutionStrategy. Do you know if this is suitable, or do i have to provide my own logic? – Jeroen1984 Dec 18 '14 at 13:47
  • Not sure about this, sorry. I have some services in different languages. They usually retray connections every x time (x variable based on kind of service or application) – user_0 Dec 18 '14 at 13:50

1 Answers1

1

I am not an expert on all of this, especially, I am not sure how C# handles the db connection. What I have seen on one of our applications was that you open a connection with the db in the AlwaysOn cluster and when you fail over the application the connection is still using the same underlying TCP connection that is actually going to the node which by now has changed. For us the solution was to properly reconnect on catching an exception.

ikkjo
  • 735
  • 1
  • 9
  • 18
  • Hi @ikkjo. Came across this answer of yours. I think we are facing similar situation in our environment, despite failover the application is still connecting to what now is the secondary node. Can you elaborate a bit on what the solution was for you? – Crabster Oct 09 '15 at 08:41
  • Hi @Crabster. Not sure I can give you much more. The application already checked the health of the database connection and tried reconnecting on dropping the connection. However, when trying to commit a change to the database we got an exception. Ultimately it came down to the following (not 100% sure about this though): A transport-level error has occurred when sending the request to the server. (provider: TCP Provider, error: 0 - An existing connection was forcibly closed by the remote host.) So we caught the exception and re-opened the database connection. – ikkjo Oct 10 '15 at 09:04