Questions tagged [slurm]

Slurm (formerly spelled SLURM) is an open-source resource manager designed for Linux HPC clusters of all sizes.

Slurm: A Highly Scalable Resource Manager

Slurm is an open-source resource manager designed for Linux clusters of all sizes. It provides three key functions. First it allocates exclusive and/or non-exclusive access to resources (computer nodes) to users for some duration of time so they can perform work. Second, it provides a framework for starting, executing, and monitoring work (typically a parallel job) on a set of allocated nodes. Finally, it arbitrates contention for resources by managing a queue of pending work.

Slurm's design is very modular with dozens of optional plugins. In its simplest configuration, it can be installed and configured in a couple of minutes (see Caos NSA and Perceus: All-in-one Cluster Software Stack by Jeffrey B. Layton) and was used by Intel on their 48-core "cluster on a chip". More complex configurations can satisfy the job scheduling needs of world-class computer centers and rely upon a MySQL database for archiving accounting records, managing resource limits by user or bank account, or supporting sophisticated job prioritization algorithms.

While other resource managers do exist, Slurm is unique in several respects:

  • It is designed to operate in a heterogeneous cluster counting over 100,000 nodes and millions of processors.
  • It can sustain a throughput rate of hundreds of thousands jobs per hour with bursts of job submissions at several times that rate.
  • Its source code is freely available under the GNU General Public License.
  • It is portable; written in C and using the GNU autoconf configuration engine. While initially written for Linux, other UNIX-like operating systems should be easy porting targets.
  • It is highly tolerant of system failures, including failure of the node executing its control functions.
  • A plugin mechanism exists to support various interconnects, authentication mechanisms, schedulers, etc. These plugins are documented and simple enough for the motivated end user to understand the source and add functionality.
  • Configurable node power control functions allow putting idle nodes into a power-save/power-down mode. This is especially useful for "elastic burst" clusters which expand dynamically to a cloud virtual machine (VM) provider to accommodate workload bursts.

Resources and Tutorials:

Name Spelling

As of v18.08, the name spelling “SLURM” has been changed to “Slurm” (commit 3d7ada78e).

Other Uses of the Name

Slurm also a fictional soft drink in the Futurama multiverse where it is popular and highly addictive.

1738 questions
0
votes
1 answer

Running a bash script on nodes srun uses for an mpi job

I can launch an mpi job across multiple compute nodes using a slurm batch script and srun. As part of the slurm script, I want to launch a shell script that runs on the nodes the job is using to collect information (using the top command) about the…
Ed Hall
  • 21
  • 1
0
votes
0 answers

Activate Conda Environment From Script

I am using a server that runs on slurm. It requires the use of scripts to launch jobs. Particularly, I have to use the following command: sbatch script.sh Inside this script, I have to specify a couple of things, including the conda environment I'm…
Alfred
  • 503
  • 1
  • 5
  • 20
0
votes
1 answer

Slurm interactive mode - run pre-specified command at beginning

On my cluster, I can get a shell for interactive mode if I run: srun -N 1 --ntasks-per-node=1 --gres=gpu:1 --pty zsh However, on this cluster, each node that is allocated has an empty $HOME directory (without the .zshrc), which is stored on a…
0
votes
3 answers

Cannot install slurm seff in debian 9 (stretch)

In a cluster of debian 9 machines, I have installed slurm via apt-get , but i see that seff command is not available.How could i install it? I see that there is a folder contribs in the tar.gz file but no instructions are given on how seff (and…
potant
  • 7
  • 1
0
votes
0 answers

How to have variable number of nodes for different mpi executions in a script file in SLURM?

I would like to have 4 different mpi executions of same program, with different number of nodes. The outputs should be n_out.txt depends on the nodes. I have tried the following .sh file, #!/bin/bash #SBATCH --partition=halley #SBATCH…
0
votes
1 answer

How to set maximum allowed CPUs per job in Slurm?

How can I set the maximum number of CPUs each job can ask for in Slurm? We're running a GPU cluster and want a sensible number of CPUs to be always available for GPU jobs. This is kind of fine as long as the job asks for GPUs because there's GPU <->…
Milad
  • 4,901
  • 5
  • 32
  • 43
0
votes
1 answer

How does one implement the e-maling option for slurm?

I am using slurm in a cluster and when I turn on the e-mailing option it does not work. Is there any special type of administering I need to do to turn it on in my cluster? My sample submission script for sbatch: #!/bin/sh #SBATCH…
Charlie Parker
  • 5,884
  • 57
  • 198
  • 323
0
votes
1 answer

Batch job submission failed: Requested node configuration is not available SLURM

I am using trying to submit a bash job script with slurm. The following is contained in my bash script: #SBATCH --partition=normal #SBATCH --nodes=1 #SBATCH --ntasks-per-node=8 #SBATCH --cpus-per-task=2 #SBATCH --gres=gpu:v100d32q:1 #SBATCH…
Perl Del Rey
  • 959
  • 1
  • 11
  • 25
0
votes
1 answer

How do I enable python submission scripts on my slurm cluster?

I have access to a cluster using slurm and want to extend it to use python for sbatch submission scripts. How do I do that? I tried giving my submission script different paths to the interpreter: #!/bin/python #!/usr/bin/python #!/usr/bin/env…
Charlie Parker
  • 5,884
  • 57
  • 198
  • 323
0
votes
1 answer

how to write sbatch to handle multiple job in slurm

I have two executable file need to run: a.out and b.out. (1) I want run the a.out on two node, each node have one a.out process. (2) I want run the b.out on two node,the node is same in (1), but each node have two b.out processes. My naive code…
Xu Hui
  • 1,213
  • 1
  • 11
  • 24
0
votes
1 answer

How do I use hybrid OpenMP/OpenMPI parallelization together with GNU compilers?

I am running a physics solver that was written to use hybrid OpenMP/MPI parallelization. The job manager on our cluster is SLURM. Everything goes as expected when I am running in a pure MPI mode. However, once I try to use hybrid parallelization…
tre95
  • 433
  • 5
  • 16
0
votes
0 answers

Running a single Matlab parpool job across multiple nodes of a cluster using the Slurm scheduler

I am running a Matlab script called EmitCue on a remote cluster using the Slurm scheduler. To submit a job that will run in parallel across 24 CPU cores on a single node, I use a shell script such as: #!/bin/bash #SBATCH -A AccountName #SBATCH -J…
Jabby
  • 43
  • 1
  • 7
0
votes
1 answer

Slurm sbatch for a PyTorch script draining node; gres/gpu: count changed for node node002 from 0 to 1

We have a user whose script always drains a node. Note this error: "gres/gpu: count changed for node node002 from 0 to 1" Could it be misleading? What could cause the node to drain? Here are the contents of the user's SBATCH file. Could the piping…
RobbieTheK
  • 178
  • 1
  • 11
0
votes
0 answers

Slurm: fatal: No front end nodes defined

I need to deploy a compute server on my Debian laptop . I only have my laptop, so I decided to make the host server and node on the same computer, but systemctl status slurmctld.service gives me an error: ● slurmctld.service - Slurm controller…
user8788726
0
votes
1 answer

SLURM: Should there be a different gres.conf for each node?

When configuring a slurm cluster you need to have a copy of the configuration file slurm.conf on all nodes. These copies are identical. In the situation where you need to use GPUs in your cluster you have an additional configuration file that you…
Durai Arasan
  • 123
  • 5
1 2 3
99
100