Questions tagged [slurm]

Slurm (formerly spelled SLURM) is an open-source resource manager designed for Linux HPC clusters of all sizes.

Slurm: A Highly Scalable Resource Manager

Slurm is an open-source resource manager designed for Linux clusters of all sizes. It provides three key functions. First it allocates exclusive and/or non-exclusive access to resources (computer nodes) to users for some duration of time so they can perform work. Second, it provides a framework for starting, executing, and monitoring work (typically a parallel job) on a set of allocated nodes. Finally, it arbitrates contention for resources by managing a queue of pending work.

Slurm's design is very modular with dozens of optional plugins. In its simplest configuration, it can be installed and configured in a couple of minutes (see Caos NSA and Perceus: All-in-one Cluster Software Stack by Jeffrey B. Layton) and was used by Intel on their 48-core "cluster on a chip". More complex configurations can satisfy the job scheduling needs of world-class computer centers and rely upon a MySQL database for archiving accounting records, managing resource limits by user or bank account, or supporting sophisticated job prioritization algorithms.

While other resource managers do exist, Slurm is unique in several respects:

  • It is designed to operate in a heterogeneous cluster counting over 100,000 nodes and millions of processors.
  • It can sustain a throughput rate of hundreds of thousands jobs per hour with bursts of job submissions at several times that rate.
  • Its source code is freely available under the GNU General Public License.
  • It is portable; written in C and using the GNU autoconf configuration engine. While initially written for Linux, other UNIX-like operating systems should be easy porting targets.
  • It is highly tolerant of system failures, including failure of the node executing its control functions.
  • A plugin mechanism exists to support various interconnects, authentication mechanisms, schedulers, etc. These plugins are documented and simple enough for the motivated end user to understand the source and add functionality.
  • Configurable node power control functions allow putting idle nodes into a power-save/power-down mode. This is especially useful for "elastic burst" clusters which expand dynamically to a cloud virtual machine (VM) provider to accommodate workload bursts.

Resources and Tutorials:

Name Spelling

As of v18.08, the name spelling “SLURM” has been changed to “Slurm” (commit 3d7ada78e).

Other Uses of the Name

Slurm also a fictional soft drink in the Futurama multiverse where it is popular and highly addictive.

1738 questions
0
votes
1 answer

SLURM environmental variables are empty

I tried to submit a job via the command line incorporating --wrap instead of submitting through a submission script. And for some reason none of the slurm_variables are initialized: sbatch --job-name NVP --time 01:00:00 --nodes 1 --ntasks 1…
MirrG
  • 406
  • 3
  • 10
0
votes
2 answers

Snakemake trigger automatic job re-submission on slurm cluster

I have a question for a very specific use case. I'll start by giving a bit of background: I am trying to train a deep learning model in keras and want to do 10 fold cross validation to check training stability of the model. Usually I create…
user8707594
0
votes
1 answer

syntax error: operand expected (error token is "))")

I am trying to align my samples to a reference genome using bwa mem. I have over 300 samples which I created an index from a metadata file but something isn't really working! The loop i'm using is this (in SLURM) #SBATCH --export=ALL # export all…
0
votes
0 answers

How to isolate processes? Searching for single-node alternative to slurm

I'd like to provide a process with limited number of cores and limited memory. With slurm, I'd solve this with a command like the following: srun --pty -c32 --mem=178G bash How would I do this without slurm on a normal desktop computer?
Hoeze
  • 636
  • 5
  • 20
0
votes
1 answer

Does mpirun know if the requested number of cores is bigger or smaller than the available cores in a node?

I am considering which process launcher, between mpirun and srun, is better at optimizing the resources. Let's say one compute node in a cluster has 16 cores in total and I have a job I want to run using 10 processes. If I launch it using mpirun…
nougako
  • 234
  • 1
  • 8
0
votes
1 answer

Running Stata jobs on Slurm but errors: stata: command not found

Here is my sbatch file: #!/bin/bash #SBATCH --output=stata_example.out #SBATCH --error=stata_example.err #SBATCH --nodes=1 module load stata stata do stata_example.do But it always returns with error: stata: command not found. I have tried with…
0
votes
0 answers

SLURM sbatch multiple parent jobs in parallel, each with multiple child jobs

I want to run a fortran code called orbits_01 on SLURM. I want to run multiple jobs simultaneously (i.e. parallelize over multiple cores). After running multiple jobs, each orbits_01 program will call another executable called optimizer, and the…
Shaun Han
  • 2,676
  • 2
  • 9
  • 29
0
votes
0 answers

Python code takes longer to run with MPI (SLURM) than as a single process

I have some python code which takes approximately 12 hours to run on my laptop (MacOS 16GB 2133 MHz LPDDR3). The code is looping over a few thousand iterations and doing some intensive processing at each step so it makes sense to parallelise the…
0
votes
1 answer

slurmd.service is Failed & there is no PID file /var/run/slurmd.pid

I am trying to start slurmd.service using below commands but it is not successful permanently. I will be grateful if you could help me to resolve this issue! systemctl start slurmd scontrol update nodename=fwb-lab-tesla1 state=idle This is the…
Charlt
  • 17
  • 9
0
votes
1 answer

How to set the USIF specific variable "GLIBC" in modulefile?

I am working in a slurm-based HPC cluster, and I have done so for the past 5 years. We load and unload the modules we need for our analyses, among which the compilers such as gcc. This has worked seamlessly for me until two days ago. For the last…
schmat_90
  • 572
  • 3
  • 22
0
votes
1 answer

sbatch: error: Batch job submission failed: Requested node configuration is not available

The problem is not related to the number of the CPU assigned to the job. Before this problem, I had an error with the Nvidia driver configuration in a way that I couldn't detect the GPUs by 'nvidia-smi', after solving that error by running…
Charlt
  • 17
  • 9
0
votes
1 answer

SLURM job script to rsync files

Is there a way to submit a SLURM script to transfer files? I use rsync with command bar but I don't know how to do something similar with a SLURM script. #!/bin/bash #SBATCH --job-name=transfer # Job name #SBATCH --mail-type=END,FAIL …
blu potatos
  • 175
  • 9
0
votes
1 answer

Does slurm require same version accross all nodes?

I'me experimenting with a cluster set-up, where different nodes have either 19 or 20 version of slurm. The managing node has SLURM 20. For some reason, nodes with SLURM 19 can't ping the manager (scontrol ping returns Slurmctld(primary) on node0 is…
Araneus0390
  • 556
  • 1
  • 5
  • 18
0
votes
2 answers

SLURM: Restart worker after the worker completes

I'd like to create an array of SLURM workers, and whenever one of those workers finishes its work, I'd like to restart the worker. If it were possible to run jobs of infinite duration on my queue, I'd of course do that that instead, but because this…
duhaime
  • 25,611
  • 17
  • 169
  • 224
0
votes
1 answer

Usage of flock for copying files

I am running array jobs on slurm, so every job needs to copy a file from a local directory to a temporary one. This cp should not occur simultaneously. This is the code I came up with: mydirectory=mydb LOCKFILE_1=${mydirectory}.lock set -e ( …
Saraha
  • 144
  • 1
  • 12