4

We are in the process of migrating a stateless services based on Identity Server 3 (https://github.com/IdentityServer/IdentityServer3) to service fabric.

The project runs fine on a local development box and on a production cluster under normal development load but when put into production at about 20-30 requests per second it quickly stops responding to requests and the configured health checks in ARR go Unhealthy.

The service is fronted by a IIS ARR (Application Request Routing) cluster that does SSL off-loading.

The identity server logs output the following two errors, one seemingly related to inbound requests and another related to outbound requests to azure storage used for identity server persistence.

Microsoft.WindowsAzure.Storage.StorageException: The client could not finish the operation within specified timeout. ---> System.TimeoutException: The client could not finish the operation within specified timeout.
   --- End of inner exception stack trace ---
   at Microsoft.WindowsAzure.Storage.Core.Executor.Executor.EndExecuteAsync[T](IAsyncResult result) in c:\Program Files (x86)\Jenkins\workspace\release_dotnet_master\Lib\ClassLibraryCommon\Core\Executor\Executor.cs:line 50
   at Microsoft.WindowsAzure.Storage.Core.Util.AsyncExtensions.<>c__DisplayClass1`1.<CreateCallback>b__0(IAsyncResult ar) in c:\Program Files (x86)\Jenkins\workspace\release_dotnet_master\Lib\ClassLibraryCommon\Core\Util\AsyncExtensions.cs:line 66
--- End of stack trace from previous location where exception was thrown ---
   at System.Runtime.CompilerServices.TaskAwaiter.ThrowForNonSuccess(Task task)
   at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task)
   at System.Runtime.CompilerServices.TaskAwaiter`1.GetResult()
   at Collabco.Myday.Identity.IdSvr.BaseStore`1.<ExecuteQueryAsync>d__21.MoveNext() in C:\Dev\myday-identity\IdentityServer\IdSvr\BaseStore.cs:line 258
--- End of stack trace from previous location where exception was thrown ---
   at System.Runtime.CompilerServices.TaskAwaiter.ThrowForNonSuccess(Task task)
   at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task)
   at System.Runtime.CompilerServices.TaskAwaiter`1.GetResult()
   at Collabco.Myday.Identity.IdSvr.ScopeStore.<GetScopesAsync>d__3.MoveNext() in C:\Dev\myday-identity\IdentityServer\IdSvr\ScopeStore.cs:line 43
--- End of stack trace from previous location where exception was thrown ---
   at System.Runtime.CompilerServices.TaskAwaiter.ThrowForNonSuccess(Task task)
   at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task)
   at IdentityServer3.Core.Endpoints.DiscoveryEndpointController.<GetConfiguration>d__11.MoveNext() in c:\local\identity\server3\IdentityServer3\source\Core\Endpoints\Connect\DiscoveryEndpointController.cs:line 73
--- End of stack trace from previous location where exception was thrown ---
   at System.Runtime.CompilerServices.TaskAwaiter.ThrowForNonSuccess(Task task)
   at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task)
   at System.Threading.Tasks.System.Web.Http910180.TaskHelpersExtensions.<CastToObject>d__3`1.MoveNext() in c:\local\identity\server3\IdentityServer3\source\Core\Configuration\DiscoveryOptions.cs:line 0
--- End of stack trace from previous location where exception was thrown ---
   at System.Runtime.CompilerServices.TaskAwaiter.ThrowForNonSuccess(Task task)
   at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task)
   at System.Web.Http.Controllers.ApiControllerActionInvoker.<InvokeActionAsyncCore>d__0.MoveNext() in c:\local\identity\server3\IdentityServer3\source\Core\Configuration\DiscoveryOptions.cs:line 0
--- End of stack trace from previous location where exception was thrown ---
   at System.Runtime.CompilerServices.TaskAwaiter.ThrowForNonSuccess(Task task)
   at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task)
   at System.Web.Http.Controllers.ActionFilterResult.<ExecuteAsync>d__2.MoveNext() in c:\local\identity\server3\IdentityServer3\source\Core\Configuration\DiscoveryOptions.cs:line 0
--- End of stack trace from previous location where exception was thrown ---
   at System.Runtime.CompilerServices.TaskAwaiter.ThrowForNonSuccess(Task task)
   at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task)
   at System.Web.Http.Dispatcher.HttpControllerDispatcher.<SendAsync>d__1.MoveNext() in c:\local\identity\server3\IdentityServer3\source\Core\Configuration\DiscoveryOptions.cs:line 0
Request Information
RequestID:
RequestDate:
StatusMessage:

and another exception we see is:

System.ObjectDisposedException: Cannot access a disposed object.
Object name: 'System.Net.HttpListenerRequest'.
   at System.Net.HttpListenerRequest.CheckDisposed()
   at System.Net.HttpListenerRequest.get_LocalEndPoint()
   at System.Net.HttpListenerRequest.get_IsLocal()
   at Microsoft.Owin.Host.HttpListener.RequestProcessing.OwinHttpListenerContext.GetServerIsLocal()
   at Microsoft.Owin.Host.HttpListener.RequestProcessing.CallEnvironment.get_ServerIsLocal()
   at Microsoft.Owin.Host.HttpListener.RequestProcessing.CallEnvironment.PropertiesTryGetValue(String key, Object& value)
   at Microsoft.Owin.Host.HttpListener.RequestProcessing.CallEnvironment.TryGetValue(String key, Object& value)
   at Microsoft.Owin.OwinContext.Get[T](String key) in c:\local\identity\server3\IdentityServer3\source\Core\Configuration\DiscoveryOptions.cs:line 0
   at System.Web.Http.Owin.OwinHttpRequestContext.get_IsLocal() in c:\local\identity\server3\IdentityServer3\source\Core\Configuration\DiscoveryOptions.cs:line 0
   at System.Web.Http.Owin.OwinHttpRequestContext.get_IncludeErrorDetail() in c:\local\identity\server3\IdentityServer3\source\Core\Configuration\DiscoveryOptions.cs:line 0
   at System.Net.Http.System.Web.Http910180.HttpRequestMessageExtensions.CreateErrorResponse(HttpRequestMessage request, HttpStatusCode statusCode, Func`2 errorCreator) in c:\local\identity\server3\IdentityServer3\source\Core\Configuration\DiscoveryOptions.cs:line 0
   at System.Web.Http.ExceptionHandling.DefaultExceptionHandler.Handle(ExceptionHandlerContext context) in c:\local\identity\server3\IdentityServer3\source\Core\Configuration\DiscoveryOptions.cs:line 0
   at System.Web.Http.ExceptionHandling.DefaultExceptionHandler.HandleAsync(ExceptionHandlerContext context, CancellationToken cancellationToken) in c:\local\identity\server3\IdentityServer3\source\Core\Configuration\DiscoveryOptions.cs:line 0
   at System.Web.Http.ExceptionHandling.LastChanceExceptionHandler.HandleAsync(ExceptionHandlerContext context, CancellationToken cancellationToken) in c:\local\identity\server3\IdentityServer3\source\Core\Configuration\DiscoveryOptions.cs:line 0
   at System.Web.Http.ExceptionHandling.ExceptionHandlerExtensions.<HandleAsyncCore>d__0.MoveNext() in c:\local\identity\server3\IdentityServer3\source\Core\Configuration\DiscoveryOptions.cs:line 0
--- End of stack trace from previous location where exception was thrown ---
   at System.Runtime.CompilerServices.TaskAwaiter.ThrowForNonSuccess(Task task)
   at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task)
   at System.Web.Http.Dispatcher.HttpControllerDispatcher.<SendAsync>d__1.MoveNext() in c:\local\identity\server3\IdentityServer3\source\Core\Configuration\DiscoveryOptions.cs:line 0
--- End of stack trace from previous location where exception was thrown ---
   at System.Runtime.CompilerServices.TaskAwaiter.ThrowForNonSuccess(Task task)
   at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task)
   at System.Web.Http.Owin.PassiveAuthenticationMessageHandler.<SendAsync>d__0.MoveNext() in c:\local\identity\server3\IdentityServer3\source\Core\Configuration\DiscoveryOptions.cs:line 0
--- End of stack trace from previous location where exception was thrown ---
   at System.Runtime.CompilerServices.TaskAwaiter.ThrowForNonSuccess(Task task)
   at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task)
   at System.Web.Http.HttpServer.<SendAsync>d__0.MoveNext() in c:\local\identity\server3\IdentityServer3\source\Core\Configuration\DiscoveryOptions.cs:line 0

After approximately 5-10 minutes passed after removing load without recycling anything the service comes back to life. Service fabric detects no failures also during the unstable period.

Anyone any ideas?

David Bown
  • 61
  • 3
  • `Microsoft.WindowsAzure.Storage.StorageException: The client could not finish the operation within specified timeout. ---> System.TimeoutException: The client could not finish the operation within specified timeout.`. Looks like your Azure Storage is pretty slow on handling requests. By the way, the default retry policy for service fabric proxy if calling code throws TimeoutException is to retry the call 10-11 times. Do you handle exceptions from Azure Storage? How? – cassandrad Oct 14 '16 at 14:00
  • @cassandrad My question is why is this happening in the first place because i have exactly the same code, bar a few owin self-host/service fabric specifics, running in production on azure web apps with no storage timeouts logged. I will look into exception handling, it maybe this particular one isn't being handled. I am using the built-in retry support in the azure storage client by the way. I wonder if there is something specific to service fabric i need to configure regarding http requests or connection service points? – David Bown Oct 14 '16 at 14:52
  • I would look onto two things: from which nodes calls are coming, I mean if you have a stateless API gateway, then how many real calls to that Gateway has happened compared to calls to your service that stores to DB performed from that gateway. And is there any exceptions in the services that stores data and if there are, then which one. Also I would measure average time that a single store-to-azure operation takes, literally one call of Insert method. There is nothing specific that needs to be configured. – cassandrad Oct 17 '16 at 11:14

1 Answers1

2

The solution to this was to:

  1. Ensure that timeout exceptions were being caught and handled and logged, in some cases they were not.
  2. Tweaked the retry policy to make retries closer together (100ms delay) and limited to no more than 10 seconds. Also implemented logging of retries. In reality i'm getting no retries so far.
  3. Increased the connection limit on the table service endpoint to 1000.

The most significant factor appears to be number 3 although some improvement was noted by implementing 1 which i did first which is kind of obvious really.

To do this I use a static constructor on my repository class (base store class in identity server terms) has the following code which depends on a StorageAccount instance. All but the last line already existed.

        var tableServicePoint = System.Net.ServicePointManager.FindServicePoint(storageAccount.TableEndpoint);
        tableServicePoint.UseNagleAlgorithm = false;
        tableServicePoint.Expect100Continue = false;
        tableServicePoint.ConnectionLimit = 1000;

The following article helped with this: https://azure.microsoft.com/en-gb/documentation/articles/storage-performance-checklist/

In summary my conclusion is the default connection limit (2 or 10 different documentation conflicts) was causing request to table storage to queue up and eventually timeout and ultimately cause the service to fall over.

To confirm setting the connection limit wasn't necessary when using azure web apps so there is something different about the way service fabric / owin self-host projects work in terms of their use of connections that affects azure storage.

David Bown
  • 61
  • 3