0

We set up an event grid and a logic app as a subscriber to the event grid. One of our core requirements is not loose the requests once the event grid receives the requests. Not loosing requests means this system must come back to the caller either successful or fail.

So what we have,

  1. Time-to-live in Event Grid: when the logic app is dead (does not pull the requests from Event Grid), the requests fallout to "Time-to-live" queue and notify the caller "fail"
  2. Logic App timeout: When some parts of the Logic app fails or loop, the timeout will occur, we notify the caller "fail"
  3. Logic app runs smoothly, then "Success". (over simplifying)

Now a question is what if the logic app crashes entirely? Because if the logic app crashes, then its timeout (2 above) will not be functional? So therefore we never be able to return to the caller?

Are there solutions we can do in the infrastructure level without building a complex mechanism?

For example, Logic app disaster recovery, set up two instances in different regions?

Or should we do something like

  1. should we create another timer that exists completely separate from the logic app? So the additional timer won't go down together with the logic app?
  2. Should we save the request statuses as the logic app progresses, then create another function app to look at the request statuses, and when the logic app comes back up, the function app pick the requests up based on the statues and push them again to the logic app?

Thank you kindly

Looked at MS logic app technical documentation

Toby Yoo
  • 5
  • 3

1 Answers1

0

Interesting problem you got there. Im just gonna throw some thoughts into the void to shape some thoughts :)

If you are worried about your LA crashing and not being able to notify your event source about the crash, there is certainly only two alternatives, and that is redundancy and/or event durability.


redundancy: By fanning out your solution and letting multiple workers handle the events sent by Event Grid, you would increase the odds of the message actually making it back to the event source. This solution however, requires that the receiver of the "success / fail" message can handle duplicates. This solution, is in my opinion a lazy solution that will work in the short run, but does not really solve the problem.

enter image description here

event durability: What I think you should do is to involve a Service Bus in one way or another to use the benefit of the Dead Letter Queue / Max delivery count. The first problem struggle with this is, how do we get the event into the Service Bus Topic/Queue? Well, we could use a few function apps (more than one in different areas to not have a single point of failure) to ingest the data into the service bus, see image:

enter image description here

The function apps will in the normal case ingest multiple copies of the event to the service bus, thats where the Duplicate detection comes in and saves us. We can now trigger the Logic App normally and let it handle the event. Once the event is marked as succeeded, the message is removed from the service bus and a succeed message is sent to the event source.

In case of Logic App failure: In case the logic app crashes and the message can not be completed, the message is either ran again once the logic app comes online or the message is put on the dead letter queue. If the message is put on the dead letter queue, you could have another Logic App / Function App that triggers and sends the fail message to the event source (from another region, to limit the chance that resource being down as well).

This solution might be a bit to much, but I'm just throwing thoughts out there to trigger your imagination.

Eric Qvarnström
  • 779
  • 6
  • 21
  • Wow.. thank you so much for your time to answer. Really appreciate that. The second option would work but I was wondering if azure logic app might have the processing state we could save into the database? Much like NServiceBus to enhance the message durability? Or maybe use geo-redudency? Or maybe use load-balancer? At this stage, I would like to see a way of achieving it avoiding any dramatic changes on application architecture rather relying on infrastructure change if possible:) – Toby Yoo Jul 24 '23 at 21:17
  • @TobyYoo I guess that fetching the process state of the LA could be solved by adding extensive diagnostic settings to the LA and let that fill a log analytics workspace with runtime metrics & logs for you (?). You could then create alert rules in azure monitor to notify the event sender about failed runs. Regarding the geo-redundancy, I cannot see how that would protect you agains sudden crashes of the LA. Unless you fan out as in my first example, but that solution requires the receiver of the ACK's to manage duplicate messages :) – Eric Qvarnström Jul 25 '23 at 06:06
  • @TobyYoo I guess that your event-generator is not able to publish messages directly to an Service Bus, otherwise that would without doubt be the best option for this kind of problem :) – Eric Qvarnström Jul 25 '23 at 06:07
  • 1
    thank you so much for your time to share your idea. We decide to take a leave and see approach for now. We don't want to spend too much effort for what may never happen. If it happens in the future, depending on its criticality to the business, we will create a small seperate timer around the logic app etc. If we had to go for a full blown solution I would consider your suggested solution though. So I will accept your answer. Thank you again. – Toby Yoo Aug 19 '23 at 22:01