2

I have a pending job and I want to resize it. I tried:

scontrol update job <jobid> NumNodes=128

It does not work.

Note: I can change the walltime using scontrol. But when I try to change number of nodes, it failed. It looks like I can change the nodes according to this page http://www.nersc.gov/users/computational-systems/cori/running-jobs/monitoring-jobs/.

Baum mit Augen
  • 49,044
  • 25
  • 144
  • 182
Jason
  • 316
  • 3
  • 14

2 Answers2

4

Here is a solution I got from NERSC help desk (Credits to Woo-Sun Yang at LBNL):

$ scontrol update jobid=jobid numnodes=new_numnodes-new_numnodes

E.g. $ scontrol update jobid=12345 numnodes=10-10

The trick is to have numnodes in the above format. It works for both shrinking and expanding your nodes.

Jason
  • 316
  • 3
  • 14
3

You can resize jobs in Slurm provided that the job is pending or running.

According to the FAQ, you can resize following the next steps (with examples):

Expand

  1. Assuming that j1 requests 4 nodes and is submitted with:

    $ salloc -N4 bash
    
  2. Submit a new job (j2) with the number of extra nodes for j1 (in this case 10 for a total of 14 nodes) and make it dependent of j1 (SLURM_JOBID):

    $ salloc -N10 --dependency=expand:$SLURM_JOBID
    
  3. Deallocate the nodes of j2:

    $ scontrol update jobid=$SLURM_JOBID NumNodes=0
    
  4. Terminate j2:

    $ exit
    
  5. Assign to j1 the previous released nodes:

    $ scontrol update jobid=$SLURM_JOBID NumNodes=ALL
    
  6. And update the environmental variables of j1:

    $ ./slurm_job_$SLURM_JOBID_resize.sh
    

Now, j1 has 14 nodes.

Shrink

  1. Assuming that j1 has been submitted with:

    $ salloc -N4 bash
    
  2. Update j1 to the new size:

    $ scontrol update jobid=$SLURM_JOBID NumNodes=2
    $ scontrol update jobid=$SLURM_JOBID NumNodes=ALL
    
  3. And update the environmental variables of j1 (the script is created by the previous commands):

    $ ./slurm_job_$SLURM_JOBID_resize.sh
    

Now, j1 has 2 nodes.

Bub Espinja
  • 4,029
  • 2
  • 29
  • 46