9

We are working on a service fabric application made up of several different services, and a key part of the way our application works is that these services need to call each other in high volumes.

We have had no problems until recently when we increased the load on our application and found it massively slowed down. After much investigation and timing various things we found that the issue seems to be that when we are making a lot of calls to one type of service (of which we have several instances), the calls seemed to be some delay between us calling the service, and the service actually beginning to process the request.

We are calling between services as described by Microsoft here

To be clearer: ServiceA gets a reference to ServiceB, then calls ServiceB.GetResult(), we log the time that this method is called in ServiceA, and the first thing we do in GetResult() is log the time that the processing begins. When there is no load there is only a few ms, once we increased the load we found there to be a 4-5 second delay between these times.

Is this some kind of limit in service fabric? We have multiple instances of ServiceB and the resource usage on the cluster is essentially nothing, CPU hovers around 10% and memory use is about 1/4 on all nodes, but the throughput of the service is very low becuase it is waiting here.

Why does it wait? Is there some kind of defined limit on how different calls a service can handle at a time? Have we done something wrong with our communication?

Thank you.

QTom
  • 1,441
  • 1
  • 13
  • 29
  • How much load are we talking about here when it starts to slow down? Is it possible to quantify that in terms of calls/sec or similar? Also, if you look at the logging produced, how long time do you see between the service method start and stop events? Does that invlude the 4-5 second delay or does it only reflect the time it usually would take for that service method? – yoape Feb 02 '17 at 16:24
  • @yoape in terms of calls to ServiceB, the increased load was around 60 per second, and we were logging the time taken to complete the GetResult method and the average was around 500ms. The 4-5 second wait seemed to be outside of our code. – QTom Feb 02 '17 at 16:29
  • Can you see if there are any timeout exceptions from the service being thrown forcing the clients to retry? That would explain the delay since the default back-off time is 2 seconds and if it retried messages on average 2-3 times that would mean 4-5 sec that you would not see in the actual execution of the service method, it is basically time the client is waiting before trying again. The ``FabricTransportServiceRemotingClient`` has a built in retry functionality that looks at the ``OperationRetrySettings`` for max retry count and backoff delay. – yoape Feb 02 '17 at 16:50
  • I could see similar issues when sending large amounts of messages in parallell to a service, at a certain point the service started to getting backed up by handling the requests and the clients got to handle timeout exception, which they retried. Look at the charts in http://stackoverflow.com/a/41793846/1062217, while not a comprehensive test it shows that this happens at higher frequencies of communication. – yoape Feb 02 '17 at 16:52
  • Could you try to change the retry count for exceptions? When you create your ServiceProxyFactory inject some new values to operationretrysettings ``_serviceProxyFactory = new ServiceProxyFactory(retrySettings: new OperationRetrySettings(TimeSpan.FromMilliseconds(3), TimeSpan.FromMilliseconds(3), 0));`` This will prevent SF clients from retrying (max retry count = 0). You should now see a lot of exceptions from clients but an average execution time for the ones that are successfully handled. That is, if my theory is right. But it's an easy test and if not we have eliminated that... – yoape Feb 02 '17 at 17:15
  • @yoape thanks, after looking into the ServiceProxyFactory I found some settings that helped me prevent this queuing – QTom Feb 03 '17 at 14:38
  • good that you found the solution to your problem, that part is really not very well documented – yoape Feb 03 '17 at 17:37

1 Answers1

8

The MaxConcurrentCalls setting seemed to be what I needed.

When connecting to service:

            FabricTransportSettings transportSettings = new FabricTransportSettings
            {
                MaxConcurrentCalls = 32
            };

            ServiceProxyFactory serviceProxyFactory = new ServiceProxyFactory(
                (c) => new FabricTransportServiceRemotingClientFactory(transportSettings));

            service = serviceProxyFactory.CreateServiceProxy<T>(serviceUri);

Creating service listeners:

protected override IEnumerable<ServiceInstanceListener> CreateServiceInstanceListeners()
    {
        FabricTransportListenerSettings listenerSettings = new FabricTransportListenerSettings
        {
            MaxConcurrentCalls = 32
        };
        return new[]
        {
            new ServiceInstanceListener(
            (context) => new FabricTransportServiceRemotingListener(context,this,listenerSettings))
        };
    }
QTom
  • 1,441
  • 1
  • 13
  • 29
  • 3
    You can also specify those settings for FabricTransport in the settings.xml for the service that you are connecting to. The FabricTransportServiceRemotingListener will pick that up when it creates the listener _and_ when it creates the client factory. – yoape Feb 03 '17 at 17:36