slurm, job allocating more CPUs than requested

Question

I recently configure a slurm queing system for a server with one node and 72 cpus. Here the conf file:

# slurm.conf file generated by configurator easy.html.
# Put this file on all nodes of your cluster.
# See the slurm.conf man page for more information.
#
ControlMachine= hoffmann
##ControlAddr=
#
#MailProg=/bin/mail
MpiDefault=none
#MpiParams=ports=#-#
ProctrackType=proctrack/pgid
ReturnToService=1
SlurmctldPidFile=/var/run/slurm-llnl/slurmctld.pid
#SlurmctldPort=6817
SlurmdPidFile=/var/run/slurm-llnl/slurmd.pid
#SlurmdPort=6818
SlurmdSpoolDir=/var/spool/slurmd
SlurmUser=slurm
#SlurmdUser=root
StateSaveLocation=/var/spool/slurm-llnl
SwitchType=switch/none
TaskPlugin=task/none
#
#
# TIMERS
#KillWait=30
#MinJobAge=300
#SlurmctldTimeout=120
#SlurmdTimeout=300
#
#
# SCHEDULING
FastSchedule=1
SchedulerType=sched/backfill
SelectType=select/linear

# ---- Here to get jmore than one job running per node, seems to causedata transmission failure ----
#SelectType=select/cons_res
#SelectTypeParameters=CR_CPU_MEMORY

#SelectTypeParameters=
#
#
# LOGGING AND ACCOUNTING
AccountingStorageType=accounting_storage/none
ClusterName=hoffmann
#JobAcctGatherFrequency=30
JobAcctGatherType=jobacct_gather/none
#SlurmctldDebug=3
SlurmctldLogFile=/var/log/slurm-llnl/SlurmctldLogFile
#SlurmdDebug=3
SlurmdLogFile=/var/log/slurm-llnl/SlurmdLogFile
#
#
# COMPUTE NODES
NodeName=hoffmann CPUs=72 CoresPerSocket=18 ThreadsPerCore=2 State=UNKNOWN
PartitionName=queuing Nodes=hoffmann Default=YES MaxTime=INFINITE State=UP

It is running fine with the limitation it is allowing all cpus to each job regardless of what I am asking, the consequence being only one job can run at a time. Here the batch I am running:

#!/bin/bash
#SBATCH --job-name=test
#SBATCH --output=/home/ubuntu/test.out
#SBATCH --nodes=1
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=1
#SBATCH --time=500:00
sleep 50
echo 'done'

when I launch two of those and look at: sinfo -o "%all" I see all nodes are allocated. I guess I did a mistake in my conf file. Any idea what it can be? Thanks

score 0 · Answer 1 · answered Aug 18 '20 at 07:37

You need to uncomment the part right under the:

# ---- Here to get jmore than one job running per node, seems to causedata transmission failure ----

So uncomment the SelectType and SelectTypeParameters.

Did you uncomment that and put the comment there yourself? It should not cause any failures.

slurm, job allocating more CPUs than requested

1 Answers1