Azure Service Fabric InvokeWithRetryAsync huge overhead

Question

I'm currently working on a Service Fabric microservice which needs to have a high throughput.

I wondered why I'm not able to achieve more than 500 1KB messages per second on my workstation using loopback.

I removed all the business logic and attached a performance profiler, just to measure end to end performance.

It seems that ~96% of the time is spent resolving the Client and only ~2% doing the actual Http requests.

I'm invoking "Send" in a tight loop for the test:

private HttpCommunicationClientFactory factory = new HttpCommunicationClientFactory();

public async Task Send()
{
    var client = new ServicePartitionClient<HttpCommunicationClient>(
         factory,
         new Uri("fabric:/MyApp/MyService"));

    await client.InvokeWithRetryAsync(c => c.HttpClient.GetAsync(c.Url + "/test"));
}

Any ideas on this? According to the documentation the way I call the Services seems to be Service Fabric best practice.

UPDATE: Caching the ServicePartioningClient does improve the Performance, but using partioned services, I'm unable to cache the client, since I don't know the partition for a give PartitionKey.

UPDATE 2: I'm sorry that I didn't include full details in my initial question. We noticed the huge overhead of InvokeWithRetry when initially implementing a socket based communication.

You won't notice it that much if you are using http requests. A http request already takes ~1ms, so adding 0.5ms for the InvokeWithRetry isn't that noticable.

But if you use raw sockets which takes in our case ~ 0.005ms adding 0.5ms overhead for the InvokeWithRetry is immense!

Here is an http example, with InvokeAndRetry it takes 3x as long:

public async Task RunTest()
{
    var factory = new HttpCommunicationClientFactory();
    var uri = new Uri("fabric:/MyApp/MyService");
    var count = 10000;

    // Example 1: ~6000ms
    for (var i = 0; i < count; i++)
    {
        var pClient1 = new ServicePartitionClient<HttpCommunicationClient>(factory, uri, new ServicePartitionKey(1));
        await pClient1.InvokeWithRetryAsync(c => c.HttpClient.GetAsync(c.Url));
    }

    // Example 2: ~1800ms
    var pClient2 = new ServicePartitionClient<HttpCommunicationClient>(factory, uri, new ServicePartitionKey(1));
    HttpCommunicationClient resolvedClient = null;
    await pClient2.InvokeWithRetryAsync(
        c =>
        {
            resolvedClient = c;
            return Task.FromResult(true);
        });

    for (var i = 0; i < count; i++)
    {
        await resolvedClient.HttpClient.GetAsync(resolvedClient.Url);
    }
}

I'm aware that InvokeWithRetry adds some nice stuff I don't want to miss from the clients. But does it need to resolve the partitions on every call?

score 2 · Answer 1 · answered Jan 22 '17 at 17:14

I thought it would be nice to actually benchmark this and see what the difference actually was. I create a basic setup with a Stateful service that opens a HttpListener and a client that calls that service three different ways:

Creating a new client for each call and do all the calls in sequence

for (var i = 0; i < count; i++)
{
    var client = new ServicePartitionClient<HttpCommunicationClient>(_factory, _httpServiceUri, new ServicePartitionKey(1));
    var httpResponseMessage = await client.InvokeWithRetryAsync(c => c.HttpClient.GetAsync(c.Url + $"?index={id}"));
}

Create the client only once and reuse it for each call, in a sequence

var client = new ServicePartitionClient<HttpCommunicationClient>(_factory, _httpServiceUri, new ServicePartitionKey(1));
for (var i = 0; i < count; i++)
{
    var httpResponseMessage = await client.InvokeWithRetryAsync(c => c.HttpClient.GetAsync(c.Url + $"?index={id}"));
}

Create a new client for each call and run all the calls in parallell

var tasks = new List<Task>();
for (var i = 0; i < count; i++)
{
    tasks.Add(Task.Run(async () =>
    {
        var client = new ServicePartitionClient<HttpCommunicationClient>(_factory, _httpServiceUri, new ServicePartitionKey(1));
        var httpResponseMessage = await client.InvokeWithRetryAsync(c => c.HttpClient.GetAsync(c.Url + $"?index={id}"));
    }));
}
Task.WaitAll(tasks.ToArray());

I then ran the test for a number of counts to get a form of average:

Now, this should be taken for what it is, not a complete and comprehensive test in a controlled environment, there are a number of factors that will affect this performance, such as the cluster size, what the called service actually does (in this case nothing really) and the size and complexity of the payload (in this case a very short string).

In this test I also wanted to see how Fabric Transport behaved and the performance was similar to HTTP transport (honestly I had expected slightly better but that might not be visible in this trivial scenario).

It's worth noting that for the parallell execution of 10,000 calls the performance was significantly degraded. This is likely due to the fact that the service runs out of working memory. The effect of this might be that some of the client calls are faulted and retried (to be verified) after a delay. The way I measure the duration is the total time until all calls have completed. At the same time it should be noted that the test does not really allow the service to use more than one node since all the calls are routed to the same Partition.

To conclude, the performance effect of reusing the client is nominal, and for trivial calls HTTP performs similar to Fabric Transport.

Thanks for your your detailed tests! They helped me to update my initial question and to lock down the problem in a better explainable way. The problem lies within the behaviour of InvokeWithRetryAsync itself not in instantiating the ServicePartitionClient or not. — coalmee, Jan 22 '17 at 19:24
I think you are right avout InvokeWithRetryAsync being the main culprit here. Looking at what that code does (using dotPeek or Reflector), shows that it is doing a lot more than just calling the HTTP endpoint. The thing is, if you want your communication to be reliable you may need all that, or you could potentially implement something lightweight that fits your requirements on the communication to the tee. Btw, I think the reason why it tries to resolve the partition each time is because the primary might have changed to a replica on another node. — yoape, Jan 22 '17 at 21:57
I'm afraid I don't get the implementation working in every failure scenario and I don't have the time to test it thoughtfully. In my case fetching the endpoints once, or when the FabricClient reports a partition change would be enough. When a Client throws an exception, because of a communication failure to a node then a refetch would be also reasonable. But fetching every "nanosecond" is too much. — coalmee, Jan 23 '17 at 08:34
To answer the question about resolving every time, that's not necessary. The client should only be re-resolving if it received some sort of error in communicating. So the pattern is generally: 1) Resolve 2) Connect 3) Communicate. If you receive certain errors or the connection (if you have one) breaks, then Re-Resolve and start over (passing in the prior ResolvedServicePartition, which acts as a hint to invalidate caching in SF along the way while trying to get new endpoints/addresses). Re-resolving every time is unnecessary since if the service hasn't moved you're just hitting the same cache — masnider, Jan 23 '17 at 18:48
Okay. So this seems to be a "bug" in the ServicePartitionClient? Because InvokeWithRetryAsync is always calling Resolve. And I'm not able to reuse the ServicePartitionClient, since I'm not aware of the partitions. — coalmee, Jan 24 '17 at 08:21

Azure Service Fabric InvokeWithRetryAsync huge overhead

1 Answers1

Linked