0

I have a program that will dynamically release resources during job execution, using the command:

scontrol update JobId=$SLURM_JOB_ID NodeList=${remaininghosts}

However, this results in some very weird behavior sometimes. Where the job is re-queued. Below is the output of sacct

sacct -j 1448590

JobID NNodes State Start End NodeList


1448590 4 RESIZING 20:47:28 01:04:22 [0812,0827],[0663-0664]

1448590.0 4 COMPLETED 20:47:30 20:47:30 [0812,0827],[0663-0664]

1448590.1 4 RESIZING 20:47:30 01:04:22 [0812,0827],[0663-0664]

1448590 3 RESIZING 01:04:22 01:06:42 [0812,0827],0663

1448590 2 RESIZING 01:06:42 1:12:42 0827,tnxt-0663

1448590 4 COMPLETED 05:33:15 Unknown 0805-0807,0809]

The first lines show everything works fine, nodes are getting released but in the last line, it shows a completely different set of nodes with an unknown end time. The slurm logs show the job got requeued:

requeue JobID=1448590 State=0x8000 NodeCnt=1 due to node failure.

I suspect this might happen because the head node is killed, but the slurm documentation doesn't say anything about that.

Does anybody had an idea or suggestion?

Thanks

Okbas
  • 1
  • 1

1 Answers1

0

In this post there was a discussion about resizing jobs.

In your particular case, for shrinking I would use:

  1. Assuming that j1 has been submitted with:

    $ salloc -N4 bash
    
  2. Update j1 to the new size:

    $ scontrol update jobid=$SLURM_JOBID NumNodes=2
    $ scontrol update jobid=$SLURM_JOBID NumNodes=ALL
    
  3. And update the environmental variables of j1 (the script is created by the previous commands):

    $ ./slurm_job_$SLURM_JOBID_resize.sh
    

Now, j1 has 2 nodes.

In your example, your "remaininghost" list, as you say, may exclude the head node that is needed by Slurm to shrink the job. If you provide a quantity instead of a list, the resize should work.

Bub Espinja
  • 4,029
  • 2
  • 29
  • 46