2

I'm running bosun to alert against an elasticsearch data set.

The scenario is that there's a number of cron jobs that do various things. If these execute successfully, they will log a success message. If they die / fail to run for whatever reason and fail to log the success message, we need to know about it.

My question is how to get a 0 result if no record is found, rather than null. Here's the basic query:

nv(sum(escount(esls("logs"), "context.taskname", esand(esgte("context.elapsed_time", 0), esor(esquery("context.taskname", "Task1 or Task2 or Task3 or Task4"))), "360m", "360m", "")), 0)

If a given task has run in the interval specified, the query should return a non-zero value for the number of success messages the task has logged.

This works, but I want the alert to fire ONLY if the task hasn't run. The problem is that if Task1 hasn't run and logged a completion message, it's just dropped from the final grouping rather than returning a 0 count.

Is there a way to ensure that each task in the esor returns something, even if it's a zero value?

user101289
  • 9,888
  • 15
  • 81
  • 148

2 Answers2

1

In your situation there are 3 aspects to monitor:

  1. Have the jobs run
  2. Did the jobs run with a successful result
  3. Did the jobs run with a unsuccessful result

Elastic doesn't matter in this case, so I have simulated the responses with the series function:

alert zero_example {
    # success log messages
    $successful = sum(merge(series("job=task1", 0, 1), series("job=task2", 0, 1)))
    # error log messages
    $error = sum(merge(series("job=task1", 0, 0), series("job=task3", 0, 1)))

    # warn if no successful message or there is a non-zero number of error messages.
    # nv makes it so if there are no error messages, it will be treated as zero
    warn = nv($successful == 0, 0) || nv($error != 0, 0)

    # the final case is that a job hasn't logged. As long as the alert saw it in the 
    # first place, then Bosun will treat it as "unknown" when the result set disappears
    # from the result
}
Kyle Brandt
  • 26,938
  • 37
  • 124
  • 165
  • that sounds like a possible answer-- would you mind showing me the context of how it would be used in the example query above? – user101289 Mar 22 '17 at 17:47
  • @user101289 That exists outside of the expression language. Since it is an alert keyword, it would go in your alert definition as `unknownIsNormal = true`. You then can drop the nv. – Kyle Brandt Mar 22 '17 at 17:50
  • it didn't seem to make a difference. If I change `Task1` to something non-existent (eg. `Tasky1`) it doesn't show up in the returned list grouped by taskname – user101289 Mar 22 '17 at 17:57
  • @user101289 That is correct. There is no simple way to do that since Bosun will only know about the `context.taskname` values from the returned query. Can you explain your use case? (Also, FYI periods are not allowed in field names in elastic version 3 and later technically, so you may want to address that). – Kyle Brandt Mar 22 '17 at 18:00
  • Okay I think I understand better, need a few minutes to show an example. – Kyle Brandt Mar 22 '17 at 18:07
0

You cannot generate a series from a query that doesn't return any results. Usually if you want an alert for "X didn't happen in the last T timeframe" you need to use a larger window. So if your timeframe is 24 hours, you need to use a larger window of 72 hours and use (since(...) / 3600) > 24 to trigger the alert when the last positive result is older than 24 hours ago.

This alert would only remain active for 2 days, after which the oldest positive result would be outside the sliding window, so if it is something that could break on a weekend you and not be addressed for a few days you may wan to use 5 or 7 days for the query instead of just 3.

In your case assuming you want to see events every 6 hours this would probably be something like:

$q = escount(esls("logs"), "context.taskname", esand(esgte("context.elapsed_time", 0), esor(esquery("context.taskname", "Task1 or Task2 or Task3 or Task4"))), "1h", "72h", "")
$hoursSince = since($q) / 3600
warn = $hoursSince > 6

But still keep in mind there still MUST be a positive result in the time window for a negative (or absent) result to trigger the alert. A much better way is to get your system to generate data for both positive and negative results so you can alert on them. Or keep a counter of "work done" (emails, bytes, whatever) that is always increasing so you can see when the task stalls.

Greg Bray
  • 14,929
  • 12
  • 80
  • 104