I have a bunch of time sensitive functions that I schedule to run asynchronously (using an in-house async job scheduler). I am trying to use my observability tool (datadog) to get alerted if the run times of these functions do not meet specific SLAs (15 minutes, say). Couple of key factors:
- I need to get alerted as soon as the SLA is breached, even if the function is actively running.
- I don't want to alter the state of the function by killing/retrying if it doesn't meet the SLA.
I am thinking of using a log-based monitor for this because I log the following:
"<function_name> - <UUID> - process scheduled"
"<function_name> - <UUID> - process started"
"<function_name> - <UUID> - process finished"
However, I am noticing that datadog (or other observability tools that I've researched online) does not seem to have the ability to create a monitor like "Alert if <function_name> - <UUID> - process finished
does not show up 15 minutes after <function_name> - <UUID> - process started
One potential solution I thought of is to create a metric for each execution of the function, like <function_name>.<uuid>.state
and increment it every time so that the value of state
would be 1 for scheduled
, 2 for started
and 3 for finished
. I would then create an alert if the value of state
was 1 or 2 for > 15 minutes, but that does not scale well.
I could also potentially solve this with a custom log analytics query using Spark or some big data tool, but I'm looking for something much more 'low-tech', as it feels like this is something an observability tool should be able to provide (I could be wrong)
Any ideas would be appreciated.