3

My .NET Core 3.1 app uses Polly 7.1.0 retry and bulkhead policies for http resilience. The retry policy uses HandleTransientHttpError() to catch possible HttpRequestException.

Now http requests fired with MyClient sometimes return an HttpRequestException. Around half of them are caught and retried by Polly. The other half however ends up in my try-catch-block and I have to retry them manually. This happens before the maximum number of retries is exhausted.

How did I manage to create a race condition preventing Polly from catching all exceptions? And how can I fix this?

I register the policies with the IHttpClientFactory as follows.

public void ConfigureServices(IServiceCollection services)
{
    services.AddHttpClient<MyClient>(c =>
    {
        c.BaseAddress = new Uri("https://my.base.url.com/");
        c.Timeout = TimeSpan.FromHours(5); // Generous timeout to accomodate for retries
    })
        .AddPolicyHandler(GetHttpResiliencePolicy());
}

private static AsyncPolicyWrap<HttpResponseMessage> GetHttpResiliencePolicy()
{
    var delay = Backoff.DecorrelatedJitterBackoffV2(medianFirstRetryDelay: TimeSpan.FromSeconds(1), retryCount: 5);

    var retryPolicy = HttpPolicyExtensions
            .HandleTransientHttpError() // This should catch HttpRequestException
            .OrResult(msg => msg.StatusCode == HttpStatusCode.NotFound)
            .WaitAndRetryAsync(
                sleepDurations: delay,
                onRetry: (response, delay, retryCount, context) => LogRetry(response, retryCount, context));

    var throttlePolicy = Policy.BulkheadAsync<HttpResponseMessage>(maxParallelization: 50, maxQueuingActions: int.MaxValue);

    return Policy.WrapAsync(retryPolicy, throttlePolicy);
}

The MyClient that is firing the http requests looks as follows.

public async Task<TOut> PostAsync<TOut>(Uri requestUri, string jsonString)
{
    try
    {
        using (var content = new StringContent(jsonString, Encoding.UTF8, "application/json"))
        using (var response = await httpClient.PostAsync(requestUri, content)) // This throws HttpRequestException
        {
            // Handle response
        }
    }
    catch (HttpRequestException ex)
    {
        // This should never be hit, but unfortunately is
    }
}

Here is some additional information, although I'm not sure that it's relevant.

  1. Since the HttpClient is DI-registered transiently, there are 10 instances of it flying around per unit of work.
  2. Per unit of work, the client fires ~400 http requests.
  3. The http requests are lenghty (5 min duration, 30 MB response)
Peter Csala
  • 17,736
  • 16
  • 35
  • 75
  • I don't see any obvious errors in your Polly setup. Can you create a minimal repro, ideally using a local service? I'd say off the top of my head the problem is *probably* not Polly, which is one of the most well-tested libraries I've ever seen. – Stephen Cleary Dec 11 '20 at 22:21
  • @MatthiasSchuchardt Why do you have a trigger for `NotFound`? Do you try to deal with eventual consistency? – Peter Csala Dec 12 '20 at 07:27
  • @PeterCsala the webservice I am consuming infrequently returns a 404 under back pressure. The same requests usually return a 200 on retry or when the webservice is idling. Since I can't differentiate between the "real" and the load-based 404s I unfortunately need to retry the "real" 404s 5 times as well. – Matthias Schuchardt Dec 12 '20 at 08:20
  • @MatthiasSchuchardt If 404 is used as a back-pressure (which is kinda weird by the way ... normally 429 is used for that) then why don't use use Circuit Breaker for this scenario in order to avoid overloading the downstream system? – Peter Csala Dec 12 '20 at 08:26
  • 1
    @PeterCsala agreed, the webservice should use 429, unfortunately this is out of my control. And also agreed that I could use circuit breaker to give it some space to breathe (although a similar effect should be achieved by the exponential backoff in `DecorrelatedJitterBackoffV2`). However the point of the question is to determine, what could possibly make Polly miss an exception it is supposed to catch. Evaluating this might also benefit others. – Matthias Schuchardt Dec 14 '20 at 21:27
  • 1
    @MatthiasSchuchardt I've left an answer please check it. Regarding the CB vs backoff: they are not the same. In case of back-off all the forthcoming requests are **allowed** to be issued against the target server. In case of CB all the forthcoming requests are **blocked** until the there is a successful request when the CB is in `HalfOpen` state. – Peter Csala Dec 15 '20 at 09:35

1 Answers1

3

Retry and the HttpRequestException

Whenever we are talking about Polly policies then we can distinguish two different exceptions:

  • handled
  • unhandled.

Handled exception

  • It triggers a certain kind of behaviour of the given policy (in this case the HttpRequestException).
  • The handled exception is thrown again if the policy can't succeeded.
  • If there is an other policy then it may or may not handle that exception.

Unhandled exception

  • It does not induce any sort of reaction (for example WebException in our case).
  • The unhandled exception is flowed through the policy.
  • If there is an other policy then it may or may not handle that exception.

"Around half of them are caught and retried by Polly.
The other half however ends up in my try-catch-block"

This can happen if some of your retries run out of attempts. In other words there are some requests which could not succeeded in 6 attempts (5 retry and 1 initial attempt).

This can be easily verified with one of the following two tools:

  • onRetry + context
  • Fallback + context

onRetry + context

The onRetry is called when the retry policy is triggered but before the sleep duration. The delegate receives the retryCount. So to be able to connect / relate separate log entries of the same request you need to use some sort of correlation id. The simplest way to have one can be coded like this:

public static class ContextExtensions
{
    private const string Key = "CorrelationId";

    public static Context SetCorrelation(this Context context, Guid? id = null)
    {
        context[Key] = id ?? Guid.NewGuid();
        return context;
    }

    public static Guid? GetCorrelation(this Context context)
    {
        if (!context.TryGetValue(Key, out var id))
            return null;

        if (id is Guid correlation)
            return correlation;

        return null;
    }
}

Here is a simplified example:
The to be executed method

private async Task<string> Test() 
{ 
    await Task.Delay(1000); 
    throw new CustomException(""); 
}

The policy

var retryPolicy = Policy<string>
    .Handle<CustomException>()
    .WaitAndRetryAsync(5, _ => TimeSpan.FromSeconds(1),
        (result, delay, retryCount, context) =>
        {
            var id = context.GetCorrelation();
            Console.WriteLine($"{id} - #{retryCount} retry.");
        });

The usage

var context = new Context().SetCorrelation();
try
{
    await retryPolicy.ExecuteAsync(async (ctx) => await Test(), context);
}
catch (CustomException)
{
    Console.WriteLine($"{context.GetCorrelation()} - All retry has been failed.");
}

The sample output

3319cf18-5e31-40e0-8faf-1fba0517f80d - #1 retry.
3319cf18-5e31-40e0-8faf-1fba0517f80d - #2 retry.
3319cf18-5e31-40e0-8faf-1fba0517f80d - #3 retry.
3319cf18-5e31-40e0-8faf-1fba0517f80d - #4 retry.
3319cf18-5e31-40e0-8faf-1fba0517f80d - #5 retry.
3319cf18-5e31-40e0-8faf-1fba0517f80d - All retry has been failed.

Fallback

As it was being said whenever the policy can't succeed then it will re-throw the handled exception. In other words if a policy fails then it escalates the problem to the next level (next outer policy).

Here is a simplified example:
The policy

var fallbackPolicy = Policy<string>
    .Handle<CustomException>()
    .FallbackAsync(async (result, ctx, ct) =>
    {
        await Task.FromException<CustomException>(result.Exception);
        return result.Result; //it will never be executed << just to compile
    }, 
    (result, ctx) =>
    {
        Console.WriteLine($"{ctx.GetCorrelation()} - All retry has been failed.");
        return Task.CompletedTask;
    });

The usage

var context = new Context().SetCorrelation();
try
{
    var strategy = Policy.WrapAsync(fallbackPolicy, retryPolicy);  
    await strategy.ExecuteAsync(async (ctx) => await Test(), context);
}
catch (CustomException)
{
    Console.WriteLine($"{context.GetCorrelation()} - All policies failed.");
}

The sample output

169a270e-acf7-45fd-8036-9bd1c034c5d6 - #1 retry.
169a270e-acf7-45fd-8036-9bd1c034c5d6 - #2 retry.
169a270e-acf7-45fd-8036-9bd1c034c5d6 - #3 retry.
169a270e-acf7-45fd-8036-9bd1c034c5d6 - #4 retry.
169a270e-acf7-45fd-8036-9bd1c034c5d6 - #5 retry.
169a270e-acf7-45fd-8036-9bd1c034c5d6 - All retry has been failed.
169a270e-acf7-45fd-8036-9bd1c034c5d6 - All policies failed.
Peter Csala
  • 17,736
  • 16
  • 35
  • 75
  • 1
    Thanks for the detailed answer. I already have exactly this mechanism in place. It's what happens in the `LogRetry()`-method that I also call in the `catch`-block. I use the request body as a unique identifier. That's how I determined in the first place that the uncaught exceptions happen _before_ the retries are exhausted. In some log files I could see that an exception was uncaught before even a single retry with Polly was logged. – Matthias Schuchardt Dec 16 '20 at 10:36
  • 1
    @MatthiasSchuchardt Hmm that's strange. Then the only thing that I can think of is the `using` around your HttpClient. Have tried without that? Also could please check what is the `innerException` in case of the top-level catched `HttpRequestException.` – Peter Csala Dec 16 '20 at 13:37
  • The inner exception is an `IOException: The response ended prematurely.` with no further level of nesting. Removing the `using`s does not have any effect. I still occasionally run into the `catch`-block. – Matthias Schuchardt Dec 17 '20 at 16:15
  • That's weird. `The response ended prematurely` shouldn't be the last exception. Usually it has an innerException. Do you log all of Exception hierarchy or just the top exception and its inner? – Peter Csala Dec 17 '20 at 16:41
  • I log the complete exception hierarchy. Also when I debug the app and put a breakpoint in the `catch`-block, I can see that the `IOException`'s `InnerException`-property is `null`. – Matthias Schuchardt Dec 18 '20 at 09:54
  • I'm out of idea. Maybe you can check these github issues: [1](https://github.com/dotnet/runtime/issues/31528), [2](https://github.com/dotnet/runtime/issues/30038), [3](https://github.com/dotnet/runtime/issues/31132) – Peter Csala Dec 18 '20 at 09:58