I have a program that will dynamically release resources during job execution, using the command:
scontrol update JobId=$SLURM_JOB_ID NodeList=${remaininghosts}
However, this results in some very weird behavior sometimes. Where the job is re-queued. Below is the output of sacct
sacct -j 1448590
JobID NNodes State Start End NodeList
1448590 4 RESIZING 20:47:28 01:04:22 [0812,0827],[0663-0664]
1448590.0 4 COMPLETED 20:47:30 20:47:30 [0812,0827],[0663-0664]
1448590.1 4 RESIZING 20:47:30 01:04:22 [0812,0827],[0663-0664]
1448590 3 RESIZING 01:04:22 01:06:42 [0812,0827],0663
1448590 2 RESIZING 01:06:42 1:12:42 0827,tnxt-0663
1448590 4 COMPLETED 05:33:15 Unknown 0805-0807,0809]
The first lines show everything works fine, nodes are getting released but in the last line, it shows a completely different set of nodes with an unknown end time. The slurm logs show the job got requeued:
requeue JobID=1448590 State=0x8000 NodeCnt=1 due to node failure.
I suspect this might happen because the head node is killed, but the slurm documentation doesn't say anything about that.
Does anybody had an idea or suggestion?
Thanks