I want to run some jobs in a cluster, but I want to be able to kill the job if it is taking too long. Can I do this gracefully from the client, and still have the worker available to do more jobs?
My scenario is that I want to investigate how different machine learning classifiers and hyperparameters affect the time to run .fit()
. If the time takes too long, I just want to abandon the task and move on to the next one.
I can find the PIDs of the workers, and I can use kill()
to send a signal from the client, but sending SIGINT, SIGHUP and SIGABRT all seem to ruthlessly kill the worker, not just interrupt it. I can't put any logic in the worker code because it's the atomic call to .fit()
that I want to time and interrupt.