Slurm + drake: free resources of idle job array workers for dynamic branching

Question

EDIT: the question title and tags were adjusted after the discovery that the described behavior does not stem from SLURM but from the R package {drake} which is used as a proxy to execute SLURM array jobs.

I've got the following situation:

A Slurm job array of n=70 with X CPU and Y Mem per job
120 tasks to be run
Each task requires the same CPU + Mem but takes a different time to finish

This leads to the following situation:

For tasks 71-120 (after 1 - 70 completed), I have 50 active workers and 20 idle workers. The idle workers will not do any work anymore and just wait for the active workers to complete.

Now over time more and more workers finish and at some point I have 5 active workers and 65 idle ones. Let's assume that the last 5 tasks take quite some time to complete. During this time, the idle workers block resources on the cluster and also constantly print the following to their respective log files

2021-04-03 19:41:41.866282 | > WORKER_WAIT (0.000s wait)
2021-04-03 19:41:41.868709 | waiting 1.70s
2021-04-03 19:41:43.571948 | > WORKER_WAIT (0.000s wait)

[...]

Is there a way to shut down these idle workers and free resources after there is not more task for them to be allocated? Currently they wait until all workers are done and only then release the resources.

Aren't you conflating two things here (slurm array jobs and `clustermq` workers)? I'd run the following to test: (1) `sbatch --array=1-2 (sleep script with $SLURM_ARRAY_TASK_ID)` (2) `clustermq::Q(Sys.sleep, time=c(1,60), n_jobs=2)`. Does the 2nd job/worker remain on your system in both cases? — Michael Schubert, Apr 04 '21 at 16:13
Could be! That works fine, i.e. the first workers shuts down and releases resources and does not wait until the second one finishes. Does this mean the issue is neither SLURM nor `clustermq` but workers are waiting because of the submission through {drake}? — pat-s, Apr 04 '21 at 16:30
This sounds like there's a bug in `drake`'s way of determining if there is more work to be done. In any case, it does not look like slurm is the cause. — Michael Schubert, Apr 04 '21 at 16:33

pat-s · Answer 1 · 2021-04-04T20:29:24.877

Thanks to the comment of @Michael Schubert I've found that this behavior occurs when using the R package {drake} and its dynamic branching feature (static targets are shutting down just fine).

Here, a "target" can have dynamic "subtargets" which can be computed as separate array jobs via SLURM. These subtargets are getting combined after all have been computed. Until this aggregation step happened, all workers remain in a "waiting" state in which they output the WORKER_WAIT status shown above.

Wild guess: This might not be avoidable due to the design of dynamic targets in {drake} because to aggregate all subtargets these need to exist first. Hence individual subtargets must be kept/saved in a temporary state until all subtargets are available.

The following {drake} R code can be used in conjunction with a SLURM cluster to reproduce the explained behavior:

  list_time = c(30,60),
  test_dynamic = target(
    Sys.sleep(time = list_time),
    dynamic = map(list_time)
  ),

As you suspected, the dynamic branching cleanup step is part of the priority queue, which is preventing the shutdown of superfluous workers. This is a design issue that `targets` is prepared to fix but `drake` is not. Tracking in https://github.com/ropensci/targets/issues/398. — landau, Apr 05 '21 at 13:06

Slurm + drake: free resources of idle job array workers for dynamic branching

1 Answers1