How can I monitor stalled tasks?

Question

I am running a Rust app with Tokio in prod. In the last version i had a bug, and some requests caused my code to go into an infinite loop.

What happened is while the task that got into the loop was stuck, all the other task continue to work well and processing requests, that happened until the number of stalling tasks was high enough to cause my program to be unresponsive.

My problem is took a lot of time to our monitoring systems to identify that something go wrong. For example, the task that answer to Kubernetes' health check works well and I wasn't able to identify that I have stalled tasks in my system.

So my question is if there's a way to identify and alert in such cases?

If i could find way to define timeout on task, and if it's not return to the scheduler after X seconds/millis to mark the task as stalled, that will be a good enough solution for me.

score 3 · Accepted Answer · answered Jan 13 '21 at 08:56

3

Using tracing might be an option here: following issue 2655 every tokio task should have a span. Alongside tracing-futures this means you should get a tracing event every time a task is entered or suspended (see this example), by adding the relevant data (e.g. task id / request id / ...) you should then be able to feed this information to an analysis tool in order to know:

that a task is blocked (was resumed then never suspended again)
if you add your own spans, that a "userland" span was never exited / closed, which might mean it's stuck in a non-blocking loop (which is also an issue though somewhat less so)

I think that's about the extent of it: as noted by issue 2510, tokio doesn't yet use the tracing information it generates and so provide no "built-in" introspection facilities.

answered Jan 13 '21 at 08:56

Masklinn

34,759
3
38
57

thanks for the answer - This sound cool , but if i understand it correctly - this solution it's require to monitor logs files that crated by trace, and it's required to log all this event that tasks handle/suspended. or could i create a task in my app that will handle those events instead of using log ? – Eyal leshem Jan 13 '21 at 13:38
2

`tracing` is actually an *instrumentation* system, though it has easy ways to use it as a logging system (if only for easy migration) you should be able to build your own subscriber to process those events however you wish. Take a look at `tracing_subscriber` and the OpenTelemetry or Gelf subscribers, `tracing-gelf` looks especially relevant as it works by spawning a gelf Logger into a separate task. – Masklinn Jan 13 '21 at 14:21

How can I monitor stalled tasks?

1 Answers1