When using Dask with SGE or PBS clusters I sometimes have workers becoming unresponsive.
These workers are highlighted in red in the dashboard Info section with their "Last seen" number constantly increasing.
I know this can happen if submitted tasks hold the GIL for too long but that's not the case here. I'm talking about workers for which something went wrong (probably unrelated to dask or the task itself).
They will not come back and are not detected as dead either.
The problem is that tasks submitted on these workers (they become unresponsive after receiving a task, maybe when loading the environment) never end and block everything.
Is there a setting allowing to "timeout" or "invalidate" a worker if it was unresponsive for a given time ?
If not, is it possible and what would be the recommended way to manually do this invalidation and dispatch remaining tasks on other workers ?
Thanks in advance for any help regarding this issue.