I couldn't find anything solid in the documentation. So I ran a few tests myself as I was intrigued by the question. I've also included a potential "solution" to your multistep issue, or rather, that it is not idempotent (yet?). Scroll through the wall of text, it's at the end.
First thoughts are that two instances could run the same event, if only and if the first invocation exceeds the default 30 seconds for response, which is why the "at-least-once" delivery is mentioned. Example:
- Event is published
- Instance A picks up the event
- Instance A runs for more than 30 seconds
- Event is added to retry queue as instance A hasn't replied yet
- Instance A is finished, 35 seconds total elapsed
- Event from retry queue is published after sitting in retry queue for 10 seconds
- Instance B picks up the event
I really think it comes down to the execution duration of the two instances how EventGrid reacts.
According to the documentation for Retry schedule:
If the endpoint responds within 3 minutes, Event Grid will attempt to
remove the event from the retry queue on a best effort basis but
duplicates may still be received.
This could be interpreted as such that as soon as an instance replies with a successful response EventGrid will attempt to clean out the retry queue (from the example above, instance B should never pick up the event). However, I'm also interested in what happens when instance A fails.
I tried with the following code, which will throw an exception from instance A after a certain delay:
public static class Function1
{
private static Dictionary<string, int> _Counter = new();
[FunctionName("Function1")]
public static async Task Run([EventGridTrigger] EventGridEvent eventGridEvent, ILogger log)
{
if (_Counter.ContainsKey(eventGridEvent.Id))
{
_Counter[eventGridEvent.Id]++;
log.LogInformation($"{eventGridEvent.Id}: {_Counter[eventGridEvent.Id]}");
return;
}
_Counter.Add(eventGridEvent.Id, 1);
var delay = (long)eventGridEvent.Data;
log.LogInformation($"Delay: {delay}");
await Task.Delay(TimeSpan.FromSeconds(delay));
log.LogInformation($"{eventGridEvent.Id}: {_Counter[eventGridEvent.Id]}");
throw new Exception($"{eventGridEvent.Id} Throwing after {delay} seconds delay");
}
}
Results from different delays
Delay = 35
2022-02-06T00:34:10.798 [Information] Executing 'Function1' (Reason='EventGrid trigger fired at 2022-02-06T00:34:10.7899831+00:00', Id=9d1adfe9-21bd-454c-862e-a4d8d5b312e0)
2022-02-06T00:34:10.801 [Information] Delay: 35
2022-02-06T00:34:45.807 [Information] 6d3d141d-d0ee-4e81-a648-e2497e562b84: 1
2022-02-06T00:34:45.894 [Error] Executed 'Function1' (Failed, Id=9d1adfe9-21bd-454c-862e-a4d8d5b312e0, Duration=35096ms)6d3d141d-d0ee-4e81-a648-e2497e562b84 Throwing after 35 seconds delay
2022-02-06T00:34:55.949 [Information] Executing 'Function1' (Reason='EventGrid trigger fired at 2022-02-06T00:34:55.9486015+00:00', Id=58a7a6db-2f97-4bc7-b878-030b8be25711)
2022-02-06T00:34:55.949 [Information] 6d3d141d-d0ee-4e81-a648-e2497e562b84: 2
2022-02-06T00:34:55.949 [Information] Executed 'Function1' (Succeeded, Id=58a7a6db-2f97-4bc7-b878-030b8be25711, Duration=1ms)
Seems that the retry that was most likely added to the queue after 30 seconds was either removed/replaced or its scheduled execution was "updated":
- 58-13 = 45,
- 35 for execution duration of instance A
- 10 for retry delay after instance A failed
Delay = 45
2022-02-06T00:37:29.526 [Information] Executing 'Function1' (Reason='EventGrid trigger fired at 2022-02-06T00:37:29.5262414+00:00', Id=c4596efa-7034-4efe-bcf9-85d502517711)
2022-02-06T00:37:29.526 [Information] Delay: 45
2022-02-06T00:38:14.524 [Information] 284fa140-1cdc-41e7-973f-d0ecd6f19692: 1
2022-02-06T00:38:14.530 [Error] Executed 'Function1' (Failed, Id=c4596efa-7034-4efe-bcf9-85d502517711, Duration=44999ms)284fa140-1cdc-41e7-973f-d0ecd6f19692 Throwing after 45 seconds delay
2022-02-06T00:38:24.590 [Information] Executing 'Function1' (Reason='EventGrid trigger fired at 2022-02-06T00:38:24.5901035+00:00', Id=43a1fafa-7773-44e5-9f3a-b4bf697865a9)
2022-02-06T00:38:24.590 [Information] 284fa140-1cdc-41e7-973f-d0ecd6f19692: 2
2022-02-06T00:38:24.590 [Information] Executed 'Function1' (Succeeded, Id=43a1fafa-7773-44e5-9f3a-b4bf697865a9, Duration=0ms)
This is sorta unexpected behavior. We'd have expected to see a 2nd invocation finish before the 1st one, presumably finishing after 40 seconds after the 1st invocation's start. This could indicate that EventGrid actually knows that the 1st instance hasn't failed, and is still running, so it's actually not adding the retry attempt to the queue just yet. It does however add it after the 1st instance fails, and the retry attempt is, as expected, executed 10 seconds later.
I wanted to test the above, and thought the magic number could be 3 minutes (also mentioned in the documentation) and not 30 seconds.
Delay = 185
2022-02-06T00:40:20.381 [Information] Executing 'Function1' (Reason='EventGrid trigger fired at 2022-02-06T00:40:20.3809687+00:00', Id=454b815f-6ca6-4b75-9272-711c7b6bf42a)
2022-02-06T00:40:20.381 [Information] Delay: 185
2022-02-06T00:43:00.393 [Information] Executing 'Function1' (Reason='EventGrid trigger fired at 2022-02-06T00:43:00.3928958+00:00', Id=23910142-d903-4b9b-bec2-a5665e8e1db7)
2022-02-06T00:43:00.393 [Information] 97d828e6-d1f3-4523-b043-1f76465de6ea: 2
2022-02-06T00:43:00.393 [Information] Executed 'Function1' (Succeeded, Id=23910142-d903-4b9b-bec2-a5665e8e1db7, Duration=0ms)
2022-02-06T00:43:25.381 [Information] 97d828e6-d1f3-4523-b043-1f76465de6ea: 2
2022-02-06T00:43:25.387 [Error] Executed 'Function1' (Failed, Id=454b815f-6ca6-4b75-9272-711c7b6bf42a, Duration=185002ms)97d828e6-d1f3-4523-b043-1f76465de6ea Throwing after 185 seconds delay
This is also sorta contradicting the documentation. The first instance executes at 40:20. At 43:00 (would've expected 43:20), it runs the 2nd instance, completes immediately. Then the 1st instance fails at 43:25 (as expected). Maybe it is 3 minutes and not 30 seconds. But EventGrid only looks at the minute-part of the timestamp. One thing we can take away from this is that the 1st instance's delayed failure doesn't trigger a 3rd invocation.
Let's flip when the exception is thrown; the 2nd invocation will throw an exception, like this:
public static class Function1
{
private static Dictionary<string, int> _Counter = new();
[FunctionName("Function1")]
public static async Task Run([EventGridTrigger] EventGridEvent eventGridEvent, ILogger log)
{
if (_Counter.ContainsKey(eventGridEvent.Id))
{
_Counter[eventGridEvent.Id]++;
throw new Exception($"{eventGridEvent.Id}: {_Counter[eventGridEvent.Id]}");
}
_Counter.Add(eventGridEvent.Id, 1);
var delay = (long)eventGridEvent.Data;
log.LogInformation($"Delay: {delay}");
await Task.Delay(TimeSpan.FromSeconds(delay));
log.LogInformation($"{eventGridEvent.Id} finished after {delay}: {_Counter[eventGridEvent.Id]}");
return;
}
}
Delay=35
2022-02-06T00:57:00.777 [Information] Executing 'Function1' (Reason='EventGrid trigger fired at 2022-02-06T00:57:00.7684489+00:00', Id=16505ac4-0cb2-4b1a-8b7c-d4adf964fe14)
2022-02-06T00:57:00.780 [Information] Delay: 35
2022-02-06T00:57:35.794 [Information] 7fce3bc0-f5d3-4b3e-bd5d-4997299aa9b7 finished after 35: 1
2022-02-06T00:57:35.796 [Information] Executed 'Function1' (Succeeded, Id=16505ac4-0cb2-4b1a-8b7c-d4adf964fe14, Duration=35028ms)
Still nothing after 30 seconds. I'm skipping 45, as I'm pretty sure it'll yield same result as 35.
Delay=185
2022-02-06T00:59:04.226 [Information] Executing 'Function1' (Reason='EventGrid trigger fired at 2022-02-06T00:59:04.2262900+00:00', Id=cac343e5-911b-4ba7-a0c7-69e883a61392)
2022-02-06T00:59:04.227 [Information] Delay: 185
2022-02-06T01:01:44.196 [Information] Executing 'Function1' (Reason='EventGrid trigger fired at 2022-02-06T01:01:44.1962398+00:00', Id=2c65d66b-57c8-4a7d-b9ec-d54959e84e3d)
2022-02-06T01:01:44.271 [Error] Executed 'Function1' (Failed, Id=2c65d66b-57c8-4a7d-b9ec-d54959e84e3d, Duration=68ms)e051565c-9900-4cef-9786-5795c9fc7f91: 2
2022-02-06T01:02:09.240 [Information] e051565c-9900-4cef-9786-5795c9fc7f91 finished after 185: 2
2022-02-06T01:02:09.240 [Information] Executed 'Function1' (Succeeded, Id=cac343e5-911b-4ba7-a0c7-69e883a61392, Duration=185014ms)
2022-02-06T01:02:14.293 [Information] Executing 'Function1' (Reason='EventGrid trigger fired at 2022-02-06T01:02:14.2930303+00:00', Id=729e8079-e33e-48f4-8c5a-48e102a6d9d1)
2022-02-06T01:02:14.300 [Error] Executed 'Function1' (Failed, Id=729e8079-e33e-48f4-8c5a-48e102a6d9d1, Duration=1ms)e051565c-9900-4cef-9786-5795c9fc7f91: 3
2022-02-06T01:03:14.395 [Information] Executing 'Function1' (Reason='EventGrid trigger fired at 2022-02-06T01:03:14.3956880+00:00', Id=5996df91-cecf-418f-a5b5-82d7680761f8)
2022-02-06T01:03:14.403 [Error] Executed 'Function1' (Failed, Id=5996df91-cecf-418f-a5b5-82d7680761f8, Duration=1ms)e051565c-9900-4cef-9786-5795c9fc7f91: 4
It kept failing after this 'till the retry policy's delay became longer than the Function instance's idle cool down (the static dictionary was obviously removed from memory as the instance was taken down).
This could indicate that EventGrid ignores any response from an invocation older than 3 minutes. That is, the 1st invocation's 185 second delayed "OK" response was ignored. All subsequent invocations failed, forcing the event to be queued for retry.
There might be some more to it than what these simple tests show. If you really must now I suggest opening an issues ticket in GitHub
Potential solution for your idempotency issue
I'm not sure I fully understand your "multipart step". But if you can wrap that part into a single session, you could chain your EventGridTrigger
with a session-enabled ServiceBusTrigger
.
If your events have unique identifiers (eventGridEvent.Id
), then you could build a service bus message with a session id matching the event's identifier. The service bus trigger, with sessions enabled, will make sure that the event (or session, rather) is only invoked 1 at a time.
This works because the two identical events also share the same eventGridEvent.Id
. You will however have to keep track of these identifiers. At the end of your service bus session you'd have to mark the ID as "processed". At the beginning of your session, check if that ID was previously processed; ignore the request if it was.
Read more about ServiceBusTrigger
with "Message sessions" here and here (look at the isSessionsEnabled
property) (you can search for "session" to find more on the page related to session, it's scattered around).