Questions tagged [slurm]

Slurm (formerly spelled SLURM) is an open-source resource manager designed for Linux HPC clusters of all sizes.

Slurm: A Highly Scalable Resource Manager

Slurm is an open-source resource manager designed for Linux clusters of all sizes. It provides three key functions. First it allocates exclusive and/or non-exclusive access to resources (computer nodes) to users for some duration of time so they can perform work. Second, it provides a framework for starting, executing, and monitoring work (typically a parallel job) on a set of allocated nodes. Finally, it arbitrates contention for resources by managing a queue of pending work.

Slurm's design is very modular with dozens of optional plugins. In its simplest configuration, it can be installed and configured in a couple of minutes (see Caos NSA and Perceus: All-in-one Cluster Software Stack by Jeffrey B. Layton) and was used by Intel on their 48-core "cluster on a chip". More complex configurations can satisfy the job scheduling needs of world-class computer centers and rely upon a MySQL database for archiving accounting records, managing resource limits by user or bank account, or supporting sophisticated job prioritization algorithms.

While other resource managers do exist, Slurm is unique in several respects:

  • It is designed to operate in a heterogeneous cluster counting over 100,000 nodes and millions of processors.
  • It can sustain a throughput rate of hundreds of thousands jobs per hour with bursts of job submissions at several times that rate.
  • Its source code is freely available under the GNU General Public License.
  • It is portable; written in C and using the GNU autoconf configuration engine. While initially written for Linux, other UNIX-like operating systems should be easy porting targets.
  • It is highly tolerant of system failures, including failure of the node executing its control functions.
  • A plugin mechanism exists to support various interconnects, authentication mechanisms, schedulers, etc. These plugins are documented and simple enough for the motivated end user to understand the source and add functionality.
  • Configurable node power control functions allow putting idle nodes into a power-save/power-down mode. This is especially useful for "elastic burst" clusters which expand dynamically to a cloud virtual machine (VM) provider to accommodate workload bursts.

Resources and Tutorials:

Name Spelling

As of v18.08, the name spelling “SLURM” has been changed to “Slurm” (commit 3d7ada78e).

Other Uses of the Name

Slurm also a fictional soft drink in the Futurama multiverse where it is popular and highly addictive.

1738 questions
35
votes
4 answers

How do I save print statements when running a program in SLURM?

I am running a Python code that contains print statements via SLURM. Normally when I run the Python code directly via "python program.py" the print statements appear in the terminal. When I run my program via SLURM, as expected the print…
Ian
  • 515
  • 1
  • 4
  • 10
34
votes
3 answers

Error in SLURM cluster - Detected 1 oom-kill event(s): how to improve running jobs

I'm working in a SLURM cluster and I was running several processes at the same time (on several input files), and using the same bash script. At the end of the job, the process was killed and this is the error I obtained. slurmstepd: error: Detected…
CafféSospeso
  • 1,101
  • 3
  • 11
  • 28
32
votes
1 answer

comment in bash script processed by slurm

I am using slurm on a cluster to run jobs and submit a script that looks like below with sbatch: #!/usr/bin/env bash #SBATCH -o slurm.sh.out #SBATCH -p defq #SBATCH --mail-type=ALL #SBATCH --mail-user=my.email@something.com echo "hello" Can I…
user1981275
  • 13,002
  • 8
  • 72
  • 101
21
votes
3 answers

How to run code in a debugging session from VS code on a remote using an interactive session?

I am using a cluster (similar to slurm but using condor) and I wanted to run my code using VS code (its debugger specially) and it's remote sync extension. I tried running it using my debugger in VS code but it didn't quite work as expected. First…
Charlie Parker
  • 5,884
  • 57
  • 198
  • 323
21
votes
2 answers

SLURM sacct shows 'batch' and 'extern' job names

I have submitted a job to a SLURM queue, the job has run and completed. I then check the completed jobs using the sacct command. But looking at the results of the sacct command I notice additional results that I did not expect: JobID …
Parsa
  • 3,054
  • 3
  • 19
  • 35
21
votes
3 answers

How to find from where a job is submitted in SLURM?

I submitted several jobs via SLURM to our school's HPC cluster. Because the shell scripts all have the same name, so the job names appear exactly the same. It looks like [myUserName@rclogin06 ~]$ sacct -u myUserName JobID JobName …
Sibbs Gambling
  • 19,274
  • 42
  • 103
  • 174
20
votes
1 answer

SLURM nodes, tasks, cores, and cpus

Would someone be able to clarify what each of these things actually are? From what I gathered, nodes are computing points within the cluster, essentially a single computer. Tasks are processes that can be executed either on a single node or on…
Eoin S
  • 325
  • 1
  • 2
  • 6
20
votes
2 answers

Sbatch: pass job name as input argument

I have the following script to submit job with slurm: #!/bin/sh #!/bin/bash #SBATCH -J $3 #job_name #SBATCH -n 1 #Number of processors #SBATCH -p CA nwchem $1 > $2 The first argument ($1) is my input, the second ($2) is my output and I would…
Laetis
  • 1,337
  • 3
  • 16
  • 28
20
votes
2 answers

Running TensorFlow on a Slurm Cluster?

I could get access to a computing cluster, specifically one node with two 12-Core CPUs, which is running with Slurm Workload Manager. I would like to run TensorFlow on that system but unfortunately I were not able to find any information about how…
daniel451
  • 10,626
  • 19
  • 67
  • 125
19
votes
3 answers

Python - Log memory usage

Is there a way in python 3 to log the memory (ram) usage, while some program is running? Some background info. I run simulations on a hpc cluster using slurm, where I have to reserve some memory before submitting a job. I know that my job require a…
physicsGuy
  • 3,437
  • 3
  • 27
  • 35
19
votes
1 answer

Changing the bash script sent to sbatch in slurm during run a bad idea?

I wanted to run a python script main.py multiple times with different arguments through a sbatch_run.sh script as in: #!/bin/bash #SBATCH --job-name=sbatch_run #SBATCH --array=1-1000 #SBATCH --exclude=node047 arg1=10 #arg to be change during…
Charlie Parker
  • 5,884
  • 57
  • 198
  • 323
19
votes
2 answers

SLURM: How to run 30 jobs on particular nodes only?

You need to run, say, 30 srun jobs, but ensure each of the jobs is run on a node from the particular list of nodes (that have the same performance, to fairly compare timings). How would you do it? What I tried: srun --nodelist=machineN[0-3]…
Ayrat
  • 1,221
  • 1
  • 18
  • 36
19
votes
2 answers

Use Bash variable within SLURM sbatch script

I'm trying to obtain a value from another file and use this within a SLURM submission script. However, I get an error that the value is non-numerical, in other words, it is not being dereferenced. Here is the script: #!/bin/bash # This reads out the…
Madeleine P. Vincent
  • 3,361
  • 5
  • 25
  • 30
18
votes
4 answers

Is it possible to configure the directory for sbatch's default output file?

Is there some way to configure an alternative default directory (other than the current directory) for sbatch to put the file slurm-%j.out (or slurm-%A_%a.out) that it generates when the -o is not specified? My goals here are to have a…
kjo
  • 33,683
  • 52
  • 148
  • 265
18
votes
1 answer

Slurm: What is the difference for code executing under salloc vs srun

I'm using a cluster managed by slurm to run some yarn/hadoop benchmarks. To do this I am starting the hadoop servers on nodes allocated by slurm and then running the benchmarks on them. I realize that this is not the intended way to run a production…
Daniel Goodman
  • 273
  • 1
  • 2
  • 7