Condor Timeout for idle jobs

Question

I'm running jobs on a condor cluster, but some get hung in an idle state and never seem to start, let alone finish. Short of manually doing condor_wait -wait n logfile, then condor_rm, is there a more graceful (and automatic, built in) way of terminating a hung job?

Conversely, since these jobs are in a dagman, is there a way to timeout a job in a dagman so that the later jobs can run?

score 4 · Answer 1 · answered Apr 23 '13 at 20:56

Here are two ways to cause a job to be automatically removed after being idle for too long (24 hours in this example).

Put the following in the submit file for the job:

periodic_remove = JobStatus == 1 && CurrentTime-EnteredCurrentStatus > 3600*24
Or put the following in the condor configuration file on the submit machine:

SYSTEM_PERIODIC_REMOVE = JobStatus == 1 && CurrentTime-EnteredCurrentStatus > 3600*24

Of course, it would be better to understand why the jobs are remaining in the idle state. To do that, you may find condor_q -analyze jobid helpful.

Condor Timeout for idle jobs

1 Answers1

Linked