Questions tagged [slurm]

Slurm (formerly spelled SLURM) is an open-source resource manager designed for Linux HPC clusters of all sizes.

Slurm: A Highly Scalable Resource Manager

Slurm is an open-source resource manager designed for Linux clusters of all sizes. It provides three key functions. First it allocates exclusive and/or non-exclusive access to resources (computer nodes) to users for some duration of time so they can perform work. Second, it provides a framework for starting, executing, and monitoring work (typically a parallel job) on a set of allocated nodes. Finally, it arbitrates contention for resources by managing a queue of pending work.

Slurm's design is very modular with dozens of optional plugins. In its simplest configuration, it can be installed and configured in a couple of minutes (see Caos NSA and Perceus: All-in-one Cluster Software Stack by Jeffrey B. Layton) and was used by Intel on their 48-core "cluster on a chip". More complex configurations can satisfy the job scheduling needs of world-class computer centers and rely upon a MySQL database for archiving accounting records, managing resource limits by user or bank account, or supporting sophisticated job prioritization algorithms.

While other resource managers do exist, Slurm is unique in several respects:

  • It is designed to operate in a heterogeneous cluster counting over 100,000 nodes and millions of processors.
  • It can sustain a throughput rate of hundreds of thousands jobs per hour with bursts of job submissions at several times that rate.
  • Its source code is freely available under the GNU General Public License.
  • It is portable; written in C and using the GNU autoconf configuration engine. While initially written for Linux, other UNIX-like operating systems should be easy porting targets.
  • It is highly tolerant of system failures, including failure of the node executing its control functions.
  • A plugin mechanism exists to support various interconnects, authentication mechanisms, schedulers, etc. These plugins are documented and simple enough for the motivated end user to understand the source and add functionality.
  • Configurable node power control functions allow putting idle nodes into a power-save/power-down mode. This is especially useful for "elastic burst" clusters which expand dynamically to a cloud virtual machine (VM) provider to accommodate workload bursts.

Resources and Tutorials:

Name Spelling

As of v18.08, the name spelling “SLURM” has been changed to “Slurm” (commit 3d7ada78e).

Other Uses of the Name

Slurm also a fictional soft drink in the Futurama multiverse where it is popular and highly addictive.

1738 questions
12
votes
1 answer

SLURM: Changing the maximum number of simultaneously running tasks for a running array job

I have set of an array job as follows: sbatch --array=1:100%5 ... which will limit the number of simultaneously running tasks to 5. The job is now running, and I would like to change this number to 10 (i.e. I wish I'd run sbatch --array=1:100%10…
James Owers
  • 7,948
  • 10
  • 55
  • 71
11
votes
2 answers

Solving SLURM "sbatch: error: Batch job submission failed: Requested node configuration is not available" error

We have a 4 GPU nodes with 2 36-core CPUs and 200 GB of RAM available at our local cluster. When I'm trying to submit a job with the follwoing configuration: #SBATCH --nodes=1 #SBATCH --ntasks=40 #SBATCH --cpus-per-task=1 #SBATCH…
11
votes
1 answer

SLURM job history: get full length JobName

I want to get information about my job history of SLURM jobs. I use something like sacct --starttime 2014-07-01 --format=User,JobID,Jobname,partition,state,time,start,end,elapsed,MaxRss,MaxVMSize,nnodes,ncpus,nodelist to get a summary of my jobs,…
Yoda
  • 574
  • 1
  • 9
  • 21
11
votes
1 answer

Installing/emulating SLURM on an Ubuntu 16.04 desktop: slurmd fails to start

Edit What I am really looking for is a way to emulate SLURM, something interactive and reasonably user-friendly that I can install. Original post I want to test drive some minimal examples with SLURM, and I am trying to install it all on a local…
landau
  • 5,636
  • 1
  • 22
  • 50
11
votes
3 answers

How to hold up a script until a slurm job (start with srun) is completely finished?

I am running a job array with SLURM, with the following job array script (that I run with sbatch job_array_script.sh [args]: #!/bin/bash #SBATCH ... other options ... #SBATCH --array=0-1000%200 srun ./job_slurm_script.py $1 $2 $3 $4 echo 'open'…
Marses
  • 1,464
  • 3
  • 23
  • 40
11
votes
3 answers

How to get the ID of GPU allocated to a SLURM job on a multiple GPUs node?

When I submit a SLURM job with the option --gres=gpu:1 to a node with two GPUs, how can I get the ID of the GPU which is allocated for the job? Is there an environment variable for this purpose? The GPUs I'm using are all nvidia GPUs. Thanks.
Negelis
  • 376
  • 4
  • 17
11
votes
1 answer

Slurm server with a asterisk near the "idle"

I'm using Slurm. When I run sinfo -Nel it is common to see a server designated as idle, but sometimes there is also a little asterisk near it (Like this: idle*). What does that mean? I couldn't find any info about that. (The server is up and…
ZoRo
  • 401
  • 2
  • 5
  • 12
10
votes
0 answers

Start cannot spawn child process: No such file or directory

Hi I get this message when I run my job in slurm what does it mean? tensorflow/core/platform/default/subprocess.cc:304] Start cannot spawn child process: No such file or directory
rif3aa dev
  • 147
  • 10
10
votes
1 answer

GPU allocation in Slurm: --gres vs --gpus-per-task, and mpirun vs srun

There are two ways to allocate GPUs in Slurm: either the general --gres=gpu:N parameter, or the specific parameters like --gpus-per-task=N. There are also two ways to launch MPI tasks in a batch script: either using srun, or using the usual mpirun…
Jakub Klinkovský
  • 1,248
  • 1
  • 12
  • 33
10
votes
1 answer

Create directory for log file before calling slurm sbatch

Slurm sbatch directs stdout and stderr to the files specified by the -o and -e flags, but fails to do so if the filepath contains directories that don't exist. Is there some way to automatically make the directories for my log files? Manually…
Empiromancer
  • 3,778
  • 1
  • 22
  • 53
10
votes
4 answers

How to configure the content of slurm notification emails?

Slurm can notify the user by email when certain types of events occur using options such as --mail-type and --mail-user. The emails I receive this way contain a void body and a title that looks like : SLURM Job_id=9228 Name=toto Ended, Run time…
Johann Bzh
  • 834
  • 3
  • 10
  • 25
10
votes
2 answers

Python wait Slurm job?

I have a python script that should generate a bunch of inputs for an external program to be called. The calls to the external program will be through slurm. What I want is for my script to wait until all the generated calls to the external programs…
Anon
  • 215
  • 2
  • 6
10
votes
1 answer

After submitting a .m batch job with Slurm, can I edit my .m file without changing my original submission?

Say I want to run a job on the cluster: job1.m Slurm handles the batch jobs and I'm loading Mathematica to save the output file job1.csv I submit job1.m and it is sitting in the queue. Now, I edit job1.m to have different variables and parameters,…
Alyssa
  • 101
  • 1
  • 3
10
votes
1 answer

seq uses comma as decimal separator

I have noticed a strange seq behavior on one of my computers (Ubuntu LTS 14.04): instead of using points as decimal separator it is using commas: seq 0. 0.1 0.2 0,0 0,1 0,2 The same version of seq (8.21) on my other PC gives the normal points (also…
Miguel
  • 7,497
  • 2
  • 27
  • 46
9
votes
1 answer

kubernetes with slurm, is this correct setup?

i saw that some people use Kubernetes co-exist with slurm, I was just curious as to why you need kubernetes with slurm? what is the main difference between kubernetes and slurm?
zidni
  • 91
  • 1
  • 1
  • 2