0

I have a bunch of time sensitive functions that I schedule to run asynchronously (using an in-house async job scheduler). I am trying to use my observability tool (datadog) to get alerted if the run times of these functions do not meet specific SLAs (15 minutes, say). Couple of key factors:

  1. I need to get alerted as soon as the SLA is breached, even if the function is actively running.
  2. I don't want to alter the state of the function by killing/retrying if it doesn't meet the SLA.

I am thinking of using a log-based monitor for this because I log the following:

"<function_name> - <UUID> - process scheduled"
"<function_name> - <UUID> - process started"
"<function_name> - <UUID> - process finished"

However, I am noticing that datadog (or other observability tools that I've researched online) does not seem to have the ability to create a monitor like "Alert if <function_name> - <UUID> - process finished does not show up 15 minutes after <function_name> - <UUID> - process started

One potential solution I thought of is to create a metric for each execution of the function, like <function_name>.<uuid>.state and increment it every time so that the value of state would be 1 for scheduled, 2 for started and 3 for finished. I would then create an alert if the value of state was 1 or 2 for > 15 minutes, but that does not scale well.

I could also potentially solve this with a custom log analytics query using Spark or some big data tool, but I'm looking for something much more 'low-tech', as it feels like this is something an observability tool should be able to provide (I could be wrong)

Any ideas would be appreciated.

shridharama
  • 949
  • 11
  • 18

1 Answers1

0

You could do this using SQL based alerting on logs in OpenObserve - https://github.com/openobserve/openobserve

Prabhat
  • 4,066
  • 4
  • 34
  • 41