Questions tagged [slurm]

Slurm (formerly spelled SLURM) is an open-source resource manager designed for Linux HPC clusters of all sizes.

Slurm: A Highly Scalable Resource Manager

Slurm is an open-source resource manager designed for Linux clusters of all sizes. It provides three key functions. First it allocates exclusive and/or non-exclusive access to resources (computer nodes) to users for some duration of time so they can perform work. Second, it provides a framework for starting, executing, and monitoring work (typically a parallel job) on a set of allocated nodes. Finally, it arbitrates contention for resources by managing a queue of pending work.

Slurm's design is very modular with dozens of optional plugins. In its simplest configuration, it can be installed and configured in a couple of minutes (see Caos NSA and Perceus: All-in-one Cluster Software Stack by Jeffrey B. Layton) and was used by Intel on their 48-core "cluster on a chip". More complex configurations can satisfy the job scheduling needs of world-class computer centers and rely upon a MySQL database for archiving accounting records, managing resource limits by user or bank account, or supporting sophisticated job prioritization algorithms.

While other resource managers do exist, Slurm is unique in several respects:

It is designed to operate in a heterogeneous cluster counting over 100,000 nodes and millions of processors.
It can sustain a throughput rate of hundreds of thousands jobs per hour with bursts of job submissions at several times that rate.
Its source code is freely available under the GNU General Public License.
It is portable; written in C and using the GNU autoconf configuration engine. While initially written for Linux, other UNIX-like operating systems should be easy porting targets.
It is highly tolerant of system failures, including failure of the node executing its control functions.
A plugin mechanism exists to support various interconnects, authentication mechanisms, schedulers, etc. These plugins are documented and simple enough for the motivated end user to understand the source and add functionality.
Configurable node power control functions allow putting idle nodes into a power-save/power-down mode. This is especially useful for "elastic burst" clusters which expand dynamically to a cloud virtual machine (VM) provider to accommodate workload bursts.

Resources and Tutorials:

Name Spelling

As of v18.08, the name spelling “SLURM” has been changed to “Slurm” (commit 3d7ada78e).

Other Uses of the Name

Slurm also a fictional soft drink in the Futurama multiverse where it is popular and highly addictive.

1738 questions

votes

2 answers

Snakemake gives InputFunctionException when using --profile slurm

I'm creating a pipeline using snakemake to call methylation in nanopore sequencing data. I've run snakenake using the --dryrun option and the dag is constructed successfully. But when I add the option --profile slurm, I get the following…

slurm snakemake

asked Sep 08 '20 at 17:48

dsperley

votes

1 answer

srun used in a loop: srun: Job step aborted: Waiting up to 32 seconds for job step to finish

I got a .sh file to run by srun because I want to see the dynamic print-out of the scripts. But by running srun job_spinup.sh southfr_exp 1 & I always got error (time-out due to time limited error) after 2 main loops...here is the main codes in the…

linux mpi slurm nco sbatch

asked Sep 07 '20 at 10:58

Xu Shan

votes

2 answers

How to resolve the hostname between the pods in the kubernetes cluster?

I am creating two pods with a custom docker image(ubuntu is the base image). I am trying to ping the pods from their terminal. I am able to reach it using the IP address but not the hostname. How to achieve without manually adding /etc/hosts in the…

docker kubernetes dns slurm kube-dns

asked Sep 04 '20 at 06:59

Akhil

votes

1 answer

salloc: error: Job submit/allocate failed: Invalid feature specification

I'm encountering a slurm error. I logged into slurm controller to verify if slurm is working properly $ sinfo PARTITION AVAIL TIMELIMIT NODES STATE NODELIST debug* up infinite 8 idle ip-192-168-73-[129,179],..... Checked if nodes…

slurm

asked Aug 28 '20 at 07:03

Chaitanya Bapat

3,381
6
34
59

votes

1 answer

Schedule a python execution every 24h

I'm training several Neural Networks on a server in my University. Due to limited resources for all the students, there is a job scheduling system called (Slurm) that queues all students runs and in addition, we are only allowed to run our commands…

server terminal tmux slurm

asked Aug 27 '20 at 10:26

mgrau

votes

1 answer

In Slurm, is it possible to assign a different number of CPUs for every task?

I am running MPI-over-openmp jobs in a Slurm cluster and I am trying to figure out a way to give different number of CPUs to each generated task. For example, let's say we run this job: srun --nodes 1 --ntasks 2 --cpus-per-task 2 ./mpi_exe This…

mpi slurm

asked Aug 24 '20 at 14:58

K. Iliakis

votes

1 answer

slurm, job allocating more CPUs than requested

I recently configure a slurm queing system for a server with one node and 72 cpus. Here the conf file: # slurm.conf file generated by configurator easy.html. # Put this file on all nodes of your cluster. # See the slurm.conf man page for more…

slurm

asked Aug 17 '20 at 15:59

Laetis

1,337
3
16
28

votes

1 answer

equivalent of ThreadPool for Processes in Java

I am starting a number of external processes (using the class Process) from a class in Java that is itself called by a job scheduler on a Linux platform. My problem is that the job scheduler kills everything as soon as the last process has been…

java multithreading process slurm

asked Aug 16 '20 at 16:30

Myoch

votes

1 answer

what will happen if the resources I required is not enough during the running of the job?

in slurm, what will happen if the resources I required is not enough during the running of the job? For example, #SBATCH --memory=10G; #SBATCH --cpus-per-task=2; python mytrain.py is in myscript.sh. After I run sbatch myscript.sh the job is…

hpc slurm

asked Aug 06 '20 at 21:33

Jingnan Jia

1,108
2
12
28

votes

2 answers

How to run two multiprocessing programs in one batch using SLURM?

I have SLURM cluster with several nodes with 16 vcpus per node. I've tried to run the following code: #SBATCH --nodes 2 #SBATCH --ntasks 2 #SBATCH -c 16 srun --exclusive --nodes=1 program1 & srun --exclusive --nodes=1 program2 & wait program1 and…

hpc slurm

asked Jul 23 '20 at 20:31

vachram

votes

1 answer

slurm: unable to get job's information using scontrol

When I run following command I am able to see bunch of slurm jobs. Since I can see them, I believe their log should be saved. $ sacct --format="JobID,JobName%30" JobID JobName ------------…

slurm sacct

asked Jul 17 '20 at 18:49

alper

2,919
9
53
102

votes

1 answer

Shell script: Wait for any process in a group to finish

I have a (small) list of n scripts that I need to submit to slurm on linux. Each script does some work and then writes output to a file. The work portion of each script executes much faster when I request 32 cores than when I request 16 or (worse) 8…

bash shell slurm

asked Jul 16 '20 at 14:47

Attila the Fun

votes

1 answer

Indices of batch array not in the good order

I have 4 files : text1005.txt text1007.txt text1009.txt text1012.txt I create list : FILES=$(find -type f -name "*txt") arr=${FILES} But when I want to print the indices, it doesn't give the good file. For example, echo…

arrays bash slurm indices

asked Jul 06 '20 at 11:27

Paillou

votes

2 answers

How to return 0 if awk returns null from processing an expression?

I currently have a awk method to parse through whether or not an expression output contains more than one line. If it does, it aggregates and prints the sum. For example: someexpression=$'JOBID PARTITION NAME USER ST TIME NODES…

awk slurm

asked Jun 30 '20 at 05:30

user321627

2,350
4
20
43

votes

0 answers

slurm running multiple openMP jobs each on a single node with script

I am looking for a simple way to run a program with various input files using slurm. I want to run each instance of the program on a single node so that it can make use of openMP. I found that probably the best way would be to use job arrays. But I…

openmp cluster-computing hpc slurm

asked Jun 30 '20 at 01:04

atapaka

1,172
4
14
30

Prev 1 2 3

…

99 100 Next