Questions tagged [slurm]

Slurm (formerly spelled SLURM) is an open-source resource manager designed for Linux HPC clusters of all sizes.

Slurm: A Highly Scalable Resource Manager

Slurm is an open-source resource manager designed for Linux clusters of all sizes. It provides three key functions. First it allocates exclusive and/or non-exclusive access to resources (computer nodes) to users for some duration of time so they can perform work. Second, it provides a framework for starting, executing, and monitoring work (typically a parallel job) on a set of allocated nodes. Finally, it arbitrates contention for resources by managing a queue of pending work.

Slurm's design is very modular with dozens of optional plugins. In its simplest configuration, it can be installed and configured in a couple of minutes (see Caos NSA and Perceus: All-in-one Cluster Software Stack by Jeffrey B. Layton) and was used by Intel on their 48-core "cluster on a chip". More complex configurations can satisfy the job scheduling needs of world-class computer centers and rely upon a MySQL database for archiving accounting records, managing resource limits by user or bank account, or supporting sophisticated job prioritization algorithms.

While other resource managers do exist, Slurm is unique in several respects:

It is designed to operate in a heterogeneous cluster counting over 100,000 nodes and millions of processors.
It can sustain a throughput rate of hundreds of thousands jobs per hour with bursts of job submissions at several times that rate.
Its source code is freely available under the GNU General Public License.
It is portable; written in C and using the GNU autoconf configuration engine. While initially written for Linux, other UNIX-like operating systems should be easy porting targets.
It is highly tolerant of system failures, including failure of the node executing its control functions.
A plugin mechanism exists to support various interconnects, authentication mechanisms, schedulers, etc. These plugins are documented and simple enough for the motivated end user to understand the source and add functionality.
Configurable node power control functions allow putting idle nodes into a power-save/power-down mode. This is especially useful for "elastic burst" clusters which expand dynamically to a cloud virtual machine (VM) provider to accommodate workload bursts.

Resources and Tutorials:

Name Spelling

As of v18.08, the name spelling “SLURM” has been changed to “Slurm” (commit 3d7ada78e).

Other Uses of the Name

Slurm also a fictional soft drink in the Futurama multiverse where it is popular and highly addictive.

1738 questions

votes

1 answer

Paralelizing an Rscript using a job array in Slurm

I want to run an Rscript.R using an array job in Slurm, with 1-10 tasks, whereby the task id from the job will be directed to the Rscript, to write a file named "'task id'.out", containing 'task id' in its body. However, this has proven to be more…

r shell slurm

asked Feb 11 '21 at 17:29

Rodrigo Duarte

votes

0 answers

Installing R package 'cubature': Error with compiler in remote server

I'm trying to install an R package 'cubature' in a Linux-based remote server which uses SLURM as manager. Given that I'm a user of the server, I do not have root access. I tried to install the R package locally, but I do get the following error: In…

r compiler-errors remote-server slurm

asked Feb 08 '21 at 20:50

CafféSospeso

1,101
3
11
28

votes

2 answers

Correct usage of gpus-per-task for allocation of distinct GPUs via SLURM

I am using the cons_tres SLURM plugin, which introduces, among other things, the --gpus-per-task option. If my understanding is correct, the following script should allocate two distinct GPUs on the same node: #!/bin/bash #SBATCH --ntasks=2 #SBATCH…

gpu cluster-computing nvidia slurm

asked Feb 07 '21 at 20:49

redhotsnow

votes

1 answer

ulimit stack size through slurm script

in my bash script I got the following command ulimit -s unlimited However, when I launch my job by sbatch job.sh and then ssh to one of the computer nodes to check the stack size ulimit -a I clearly see the stack size is: stack size …

stack mpi hpc slurm

asked Feb 03 '21 at 19:49

ATK

1,296
10
26

votes

0 answers

Save a plot in R without png dev.off()

I am running an R script in a batch mode on my university Linux HPC cluster. I am using a module with pre-installed R packages, so I don't think I can install anything. R version is 4.0.3 I am trying to save the plots…

r slurm

asked Jan 28 '21 at 16:09

Yulia Kentieva

votes

1 answer

How to write hostfile in Slurm script

Currently I am doing following #!/bin/bash -l #SBATCH --nodes=2 #SBATCH --ntasks-per-node=4 scontrol show hostname $SLURM_JOB_NODELIST | perl -ne 'chomb; print "$_" x4' > myhostfile This generates the following myhostfile compute-0 …

perl slurm sbatch

asked Jan 22 '21 at 11:55

ATK

1,296
10
26

votes

2 answers

How to prepare a code for execution on a cluster so that it takes one parameter from a .txt file at a time?

I am preparing some C++ code to be run on a cluster, managed by SLURM. The cluster takes one compiled file: a.out. It will then execute it on 500 different nodes via the JOB_ARRAY. When executing each copy of the file, it will need to read one input…

c++ fstream slurm read-write

asked Jan 19 '21 at 09:41

Greenhorn3.14

votes

1 answer

Set display resolution for Matlab with SLURM

I use to run some Matlab simulations on a HPC cluster. The cluster runs SLURM. One of the outputs of my Matlab script is a GIF file which shows the time evolution of what I am simulating. Every frame of the GIF file is obtained by means of the…

matlab resolution screen-resolution hpc slurm

asked Jan 14 '21 at 09:19

AndreaPaco

votes

1 answer

How do I call a Perl script in an SBATCH script for SLURM submissions?

I received a Perl script that apparently is called from an SBATCH script to be submitted as a job to a computer cluster managed by SLURM. The script is old and I am yet to become more familiar with Perl. Additionally, the Perl script is being used…

perl slurm sbatch

asked Jan 06 '21 at 20:03

dareToDiffer07

votes

1 answer

Workload Manager for Windows based HPC with GPU

We have an HPC environment on a Windows server in AWS. We would like to share the computing capability with multiple users. I am not aware of any workload manager or scheduler for the windows environment. I know about SLURM but it is not compatible…

cluster-computing hpc slurm

asked Jan 04 '21 at 12:03

Anup

votes

2 answers

Batch files created by python not running, but edit by notepad works

I intend to use python and create a bunch of batch files, but the batch files created by python cannot be uploaded, while the same codes manually inputted can. I was wondering why. My code is as follows import…

python python-3.x editor slurm

asked Jan 03 '21 at 05:52

alku

votes

1 answer

Select nodes with at most n CPUs

To submit jobs to a cluster through slurm, I can specify how many CPUs I want for a job with #SBATCH --ntasks-per-node={cpus}. However, this will send the job to any node with at least this many CPUs. This is normally fine, but say I'm on a cluster…

cluster-computing cpu slurm

asked Jan 01 '21 at 18:28

Tyberius

votes

1 answer

How to get multi GPUs same type on slurm?

How can I create a job with a multi GPU of the same type but not specific that type directly? My experiment has a constraint that all GPUs have the same type but this type can be whatever we want. Currently I am able only to create a experiment with…

gpu cluster-computing slurm

asked Dec 31 '20 at 09:17

mvxxx

votes

1 answer

SLURM srun print log instance-wise

While using slurm on multi-node cluster, I ran srun -N 2 -C worker nvidia-smi The output of this command is mangled/interleaved instead of in order. Example output: Tue Dec 15 22:37:55…

slurm

asked Dec 16 '20 at 05:00

Chaitanya Bapat

3,381
6
34
59

votes

0 answers

Error install slurm, slurmd could no be started

I am trying to install slurm in a small two pc system. But I've got the following error while start slurmd Job for slurmd.service failed because the control process exited with error code. See "systemctl status slurmd.service" and "journalctl -xe"…

slurm munge

asked Dec 04 '20 at 05:33

Coconut

Prev 1 2 3

…

99 100 Next