Questions tagged [slurm]

Slurm (formerly spelled SLURM) is an open-source resource manager designed for Linux HPC clusters of all sizes.

Slurm: A Highly Scalable Resource Manager

Slurm is an open-source resource manager designed for Linux clusters of all sizes. It provides three key functions. First it allocates exclusive and/or non-exclusive access to resources (computer nodes) to users for some duration of time so they can perform work. Second, it provides a framework for starting, executing, and monitoring work (typically a parallel job) on a set of allocated nodes. Finally, it arbitrates contention for resources by managing a queue of pending work.

Slurm's design is very modular with dozens of optional plugins. In its simplest configuration, it can be installed and configured in a couple of minutes (see Caos NSA and Perceus: All-in-one Cluster Software Stack by Jeffrey B. Layton) and was used by Intel on their 48-core "cluster on a chip". More complex configurations can satisfy the job scheduling needs of world-class computer centers and rely upon a MySQL database for archiving accounting records, managing resource limits by user or bank account, or supporting sophisticated job prioritization algorithms.

While other resource managers do exist, Slurm is unique in several respects:

  • It is designed to operate in a heterogeneous cluster counting over 100,000 nodes and millions of processors.
  • It can sustain a throughput rate of hundreds of thousands jobs per hour with bursts of job submissions at several times that rate.
  • Its source code is freely available under the GNU General Public License.
  • It is portable; written in C and using the GNU autoconf configuration engine. While initially written for Linux, other UNIX-like operating systems should be easy porting targets.
  • It is highly tolerant of system failures, including failure of the node executing its control functions.
  • A plugin mechanism exists to support various interconnects, authentication mechanisms, schedulers, etc. These plugins are documented and simple enough for the motivated end user to understand the source and add functionality.
  • Configurable node power control functions allow putting idle nodes into a power-save/power-down mode. This is especially useful for "elastic burst" clusters which expand dynamically to a cloud virtual machine (VM) provider to accommodate workload bursts.

Resources and Tutorials:

Name Spelling

As of v18.08, the name spelling “SLURM” has been changed to “Slurm” (commit 3d7ada78e).

Other Uses of the Name

Slurm also a fictional soft drink in the Futurama multiverse where it is popular and highly addictive.

1738 questions
17
votes
3 answers

How to get original location of script used for SLURM job?

I'm starting the SLURM job with script and script must work depending on it's location which is obtained inside of script itself with SCRIPT_LOCATION=$(realpath $0). But SLURM copies script to slurmd folder and starts job from there and it screws up…
Araneus0390
  • 556
  • 1
  • 5
  • 18
17
votes
6 answers

Limit the number of running jobs in SLURM

I am queuing multiple jobs in SLURM. Can I limit the number of parallel running jobs in slurm? Thanks in advance!
Philipp H.
  • 1,513
  • 3
  • 17
  • 31
17
votes
2 answers

How to set the maximum priority to a Slurm job?

as administrator I need to give the maximum priority to a given job. I have found that submission options like: --priority= or --nice[=adjustment] could be useful, but I do not know which values I should assign them in order to provide the…
Bub Espinja
  • 4,029
  • 2
  • 29
  • 46
16
votes
2 answers

How to activate a specific Python environment as part of my submission to Slurm?

I want to run a script on cluster (SBATCH file). How can activate my virtual environment (path/to/env_name/bin/activate). Does I need only to add the following code to my_script.sh file? module load python/2.7.14 source…
bib
  • 944
  • 3
  • 15
  • 32
16
votes
2 answers

How can I get detailed job run info from SLURM (e.g. like that produced for "standard output" by LSF)?

When using bsub with LSF, the -o option gave a lot of details such as when the job started and ended and how much memory and CPU time the job took. With SLURM, all I get is the same standard output that I'd get from running a script without LSF. For…
Christopher Bottoms
  • 11,218
  • 8
  • 50
  • 99
15
votes
3 answers

Is it possible to run SLURM jobs in the background using SRUN instead of SBATCH?

I was trying to run slurm jobs with srun on the background. Unfortunately, right now due to the fact I have to run things through docker its a bit annoying to use sbatch so I am trying to find out if I can avoid it all together. From my…
Charlie Parker
  • 5,884
  • 57
  • 198
  • 323
15
votes
3 answers

SLURM sbatch job array for the same script but with different input arguments run in parallel

I have a problem where I need to launch the same script but with different input arguments. Say I have a script myscript.py -p -i , where I need to consider N different par_values (between x0 and x1) and M trials for each value…
maurizio
  • 745
  • 1
  • 7
  • 25
15
votes
1 answer

How to change how frequently SLURM updates the output file (stdout)?

I am using SLURM to dispatch jobs on a supercomputer. I have set the --output=log.out option to place the content from a job's stdout into a file (log.out). I'm finding that the file is updated every 30-60 minutes, making it difficult for me to…
Neal Kruis
  • 2,055
  • 3
  • 26
  • 49
14
votes
4 answers

In Slurm, is there a quick command to determine the total number of jobs (pending and active) at a given moment?

In slurm, calling the command squeue -u will list all the jobs that are pending or active for a given user. I am wondering if there was a quick way to tally them all so that I know how many outstanding jobs there are, including pending…
user321627
  • 2,350
  • 4
  • 20
  • 43
14
votes
0 answers

Get stdout/stderr from a slurm job at runtime

I have a batch file to send a job with sbatch. The contents of the batch file is # Setting the proper SBATCH variables ... #SBATCH --error="test_slurm-%j.err" #SBATCH --output="test_slurm-%j.out" ... WORKDIR=. echo "Run…
14
votes
1 answer

Running slurm script with multiple nodes, launch job steps with 1 task

I am trying to launch a large number of job steps using a batch script. The different steps can be completely different programs and do need exactly one CPU each. First I tried doing this using the --multi-prog argument to srun. Unfortunately, when…
Nils_M
  • 1,062
  • 10
  • 24
14
votes
3 answers

SLURM display the stdout and stderr of an unfinished job

I used to use a server with LSF but now I just transitioned to one with SLURM. What is the equivalent command of bpeek (for LSF) in SLURM? bpeek bpeek Displays the stdout and stderr output of an unfinished job I couldn't find the documentation…
Dnaiel
  • 7,622
  • 23
  • 67
  • 126
13
votes
1 answer

Submit and monitor SLURM jobs using Apache Airflow

I am using the Slurm job scheduler to run my jobs on a cluster. What is the most efficient way to submit the Slurm jobs and check on their status using Apache Airflow? I was able to use a SSHOperator to submit my jobs remotely and check on their…
stardust
  • 177
  • 2
  • 9
13
votes
3 answers

Using python's multiprocessing on slurm

I am trying to run some parallel code on slurm, where the different processes do not need to communicate. Naively I used python's slurm package. However, it seems that I am only using the cpu's on one node. For example, if I have 4 nodes with 5…
physicsGuy
  • 3,437
  • 3
  • 27
  • 35
13
votes
1 answer

Running a binary without a top level script in SLURM

In SGE/PBS, I can submit binary executables to the cluster just like I would locally. For example: qsub -b y -cwd echo hello would submit a job named echo, which writes the word "hello" to its output file. How can I submit a similar job to SLURM.…
highBandWidth
  • 16,751
  • 20
  • 84
  • 131