SLURM releasing resources using scontrol update results in unknown endtime

Question

I have a program that will dynamically release resources during job execution, using the command:

scontrol update JobId=$SLURM_JOB_ID NodeList=${remaininghosts}

However, this results in some very weird behavior sometimes. Where the job is re-queued. Below is the output of sacct

sacct -j 1448590

JobID NNodes State Start End NodeList

1448590 4 RESIZING 20:47:28 01:04:22 [0812,0827],[0663-0664]

1448590.0 4 COMPLETED 20:47:30 20:47:30 [0812,0827],[0663-0664]

1448590.1 4 RESIZING 20:47:30 01:04:22 [0812,0827],[0663-0664]

1448590 3 RESIZING 01:04:22 01:06:42 [0812,0827],0663

1448590 2 RESIZING 01:06:42 1:12:42 0827,tnxt-0663

1448590 4 COMPLETED 05:33:15 Unknown 0805-0807,0809]

The first lines show everything works fine, nodes are getting released but in the last line, it shows a completely different set of nodes with an unknown end time. The slurm logs show the job got requeued:

requeue JobID=1448590 State=0x8000 NodeCnt=1 due to node failure.

I suspect this might happen because the head node is killed, but the slurm documentation doesn't say anything about that.

Does anybody had an idea or suggestion?

Thanks

score 0 · Answer 1 · answered Dec 06 '18 at 06:46

In this post there was a discussion about resizing jobs.

In your particular case, for shrinking I would use:

Assuming that j1 has been submitted with:
```
$ salloc -N4 bash
```

Update j1 to the new size:

$ scontrol update jobid=$SLURM_JOBID NumNodes=2
$ scontrol update jobid=$SLURM_JOBID NumNodes=ALL

And update the environmental variables of j1 (the script is created by the previous commands):
```
$ ./slurm_job_$SLURM_JOBID_resize.sh
```

Now, j1 has 2 nodes.

In your example, your "remaininghost" list, as you say, may exclude the head node that is needed by Slurm to shrink the job. If you provide a quantity instead of a list, the resize should work.

SLURM releasing resources using scontrol update results in unknown endtime

1 Answers1