I'm having trouble getting my head around the way jobs are launched by SLURM from an sbatch
script. It seems like SLURM is ignoring the --ntasks
argument and launching all the srun
tasks in my batch file immediately. Here is an example, using a slight modification of the code from this answer on StackOverflow:
$ salloc --ntasks=1 --ntasks-per-core=1
salloc: Granted job allocation 1172
$ srun -n 1 sleep 10 & time srun -n 1 echo ok
[1] 5023
srun: cluster configuration lacks support for cpu binding
srun: cluster configuration lacks support for cpu binding
ok
real 0m0.052s
user 0m0.004s
sys 0m0.012s
So on my setup the srun echo
command is being run immediately, whereas I would expect it to run after the srun sleep 10
command finishes.
I am using SLURM 2.6.5 to schedule and submit jobs on my personal workstation with 8 cores, and I installed it myself—so it's entirely possible the configuration is borked. Here are some relevant parts from the slurm.conf file:
# SCHEDULING
SchedulerType=sched/backfill
SelectType=select/cons_res
SelectTypeParameters=CR_CPU
# COMPUTE NODES
NodeName=Tom NodeAddr=localhost CPUs=7 RealMemory=28100 State=UNKNOWN
PartitionName=Tom Nodes=Tom Default=YES MaxTime=INFINITE State=UP
Here is the output from printenv | grep SLURM
after running salloc --ntasks=1
SLURM_NODELIST=Tom
SLURM_NODE_ALIASES=(null)
SLURM_MEM_PER_CPU=4100
SLURM_NNODES=1
SLURM_JOBID=1185
SLURM_NTASKS=1
SLURM_TASKS_PER_NODE=1
SLURM_JOB_ID=1185
SLURM_SUBMIT_DIR=/home/tom/
SLURM_NPROCS=1
SLURM_JOB_NODELIST=Tom
SLURM_JOB_CPUS_PER_NODE=1
SLURM_SUBMIT_HOST=Tom
SLURM_JOB_NUM_NODES=1
I'd appreciate any comments or suggestions. Please let me know if any more info is required.
Thanks for reading,
Tom
Update after playing around some more
I have made some progress but I'm still not quite getting the behaviour I want.
If I use --exclusive
I can get the echo
step to wait for the sleep
step:
salloc --ntasks=1
salloc: Granted job allocation 2387
srun -n 1 --exclusive sleep 10 & time srun -n 1 --exclusive echo ok
[1] 16602
ok
[1]+ Done srun -n 1 --exclusive sleep 10
real 0m10.094s
user 0m0.017s
sys 0m0.037s
and
salloc --ntasks=2
salloc: Granted job allocation 2388
srun -n 1 --exclusive sleep 10 & time srun -n 1 --exclusive echo ok
[1] 16683
ok
real 0m0.067s
user 0m0.005s
sys 0m0.020s
But I still don't know how to do this properly if I'm running a multi-step job where each step needs several processors, e.g.
salloc --ntasks=6
salloc: Granted job allocation 2389
srun -n 2 --exclusive stress -c 2 &
srun -n 2 --exclusive stress -c 2 &
srun -n 2 --exclusive stress -c 2 &
will give me 12 stress
processes, as will
salloc --ntasks=6
salloc: Granted job allocation 2390
srun -n 1 --exclusive stress -c 2 &
srun -n 1 --exclusive stress -c 2 &
srun -n 1 --exclusive stress -c 2 &
srun -n 1 --exclusive stress -c 2 &
srun -n 1 --exclusive stress -c 2 &
srun -n 1 --exclusive stress -c 2 &
So what should I do if I want my sbatch
script to take 6 processors and start three steps at a time, each with 2 processors? Is it correct to use srun --exclusive -n 1 -c 2 stress -c 2
?