I'm running jobs on a condor cluster, but some get hung in an idle state and never seem to start, let alone finish. Short of manually doing condor_wait -wait n logfile
, then condor_rm
, is there a more graceful (and automatic, built in) way of terminating a hung job?
Conversely, since these jobs are in a dagman, is there a way to timeout a job in a dagman so that the later jobs can run?