I am running a Rust app with Tokio in prod. In the last version i had a bug, and some requests caused my code to go into an infinite loop.
What happened is while the task that got into the loop was stuck, all the other task continue to work well and processing requests, that happened until the number of stalling tasks was high enough to cause my program to be unresponsive.
My problem is took a lot of time to our monitoring systems to identify that something go wrong. For example, the task that answer to Kubernetes' health check works well and I wasn't able to identify that I have stalled tasks in my system.
So my question is if there's a way to identify and alert in such cases?
If i could find way to define timeout on task, and if it's not return to the scheduler after X seconds/millis to mark the task as stalled, that will be a good enough solution for me.