I've been trying to get this running for the past couple days but I can't get it to work.
I have an HTCondor cluster with 5 nodes that are often filled by my users with jobs that sometimes run a very long time (i.e. days). When new jobs are submitted while these long-running jobs are active, the new jobs wait for the old jobs to finish and only then start.
What I want to achieve is the following:
If new jobs come in while the cluster is saturated with long-running jobs, I want the long-running jobs to suspend and the new jobs to start. The long-running old jobs should only continue after the new jobs have finished. So basically, I want to give new users a chance to have their jobs run even if the cluster is saturated by long-running jobs. I also want the long-running jobs to continue after the new jobs have finished, so they shouldn't just be kicked out completely. Ideally, their progress should be kept, so I think SUSPEND is the correct option here (instead of PREEMPT or VACATE)?
I tried the following config but I don't think I used the variables correctly. I wanted to achieve this:
IF new_user submits a job AND cluster is saturated by jobs from old_user AND new_user_prio < old_user_prio AND old_job_runtime > 3 hours THEN suspend old_job and run new_job
LONG_JOB = ifThenElse(TotalJobRunTime > 10800, True, False)
PRIORITIZE_INCOMING_USER = ifThenElse(SubmitterUserPrio < RemoteUserPrio, True, False)
WANT_SUSPEND = $(LONG_JOB) && $(PRIORITIZE_INCOMING_USER)
SUSPEND = $(LONG_JOB) && $(PRIORITIZE_INCOMING_USER)
I think, I also need a PERIODIC_RELEASE in there somewhere so that old_jobs will continue if the new_jobs finish at some point, right?
Thanks in advance for any help!
TL/DR: I tried to make HTcondor suspend old jobs running (for > 3 hours) on my cluster whenever new jobs from a different user come in if the new user has a lower PRIO than the old user (i.e. if he has used the cluster less than the old user).