Questions tagged [slurm]

Slurm (formerly spelled SLURM) is an open-source resource manager designed for Linux HPC clusters of all sizes.

Slurm: A Highly Scalable Resource Manager

Slurm is an open-source resource manager designed for Linux clusters of all sizes. It provides three key functions. First it allocates exclusive and/or non-exclusive access to resources (computer nodes) to users for some duration of time so they can perform work. Second, it provides a framework for starting, executing, and monitoring work (typically a parallel job) on a set of allocated nodes. Finally, it arbitrates contention for resources by managing a queue of pending work.

Slurm's design is very modular with dozens of optional plugins. In its simplest configuration, it can be installed and configured in a couple of minutes (see Caos NSA and Perceus: All-in-one Cluster Software Stack by Jeffrey B. Layton) and was used by Intel on their 48-core "cluster on a chip". More complex configurations can satisfy the job scheduling needs of world-class computer centers and rely upon a MySQL database for archiving accounting records, managing resource limits by user or bank account, or supporting sophisticated job prioritization algorithms.

While other resource managers do exist, Slurm is unique in several respects:

  • It is designed to operate in a heterogeneous cluster counting over 100,000 nodes and millions of processors.
  • It can sustain a throughput rate of hundreds of thousands jobs per hour with bursts of job submissions at several times that rate.
  • Its source code is freely available under the GNU General Public License.
  • It is portable; written in C and using the GNU autoconf configuration engine. While initially written for Linux, other UNIX-like operating systems should be easy porting targets.
  • It is highly tolerant of system failures, including failure of the node executing its control functions.
  • A plugin mechanism exists to support various interconnects, authentication mechanisms, schedulers, etc. These plugins are documented and simple enough for the motivated end user to understand the source and add functionality.
  • Configurable node power control functions allow putting idle nodes into a power-save/power-down mode. This is especially useful for "elastic burst" clusters which expand dynamically to a cloud virtual machine (VM) provider to accommodate workload bursts.

Resources and Tutorials:

Name Spelling

As of v18.08, the name spelling “SLURM” has been changed to “Slurm” (commit 3d7ada78e).

Other Uses of the Name

Slurm also a fictional soft drink in the Futurama multiverse where it is popular and highly addictive.

1738 questions
0
votes
2 answers

Snakemake gives InputFunctionException when using --profile slurm

I'm creating a pipeline using snakemake to call methylation in nanopore sequencing data. I've run snakenake using the --dryrun option and the dag is constructed successfully. But when I add the option --profile slurm, I get the following…
dsperley
  • 33
  • 4
0
votes
1 answer

srun used in a loop: srun: Job step aborted: Waiting up to 32 seconds for job step to finish

I got a .sh file to run by srun because I want to see the dynamic print-out of the scripts. But by running srun job_spinup.sh southfr_exp 1 & I always got error (time-out due to time limited error) after 2 main loops...here is the main codes in the…
Xu Shan
  • 175
  • 3
  • 11
0
votes
2 answers

How to resolve the hostname between the pods in the kubernetes cluster?

I am creating two pods with a custom docker image(ubuntu is the base image). I am trying to ping the pods from their terminal. I am able to reach it using the IP address but not the hostname. How to achieve without manually adding /etc/hosts in the…
Akhil
  • 11
  • 1
  • 1
0
votes
1 answer

salloc: error: Job submit/allocate failed: Invalid feature specification

I'm encountering a slurm error. I logged into slurm controller to verify if slurm is working properly $ sinfo PARTITION AVAIL TIMELIMIT NODES STATE NODELIST debug* up infinite 8 idle ip-192-168-73-[129,179],..... Checked if nodes…
Chaitanya Bapat
  • 3,381
  • 6
  • 34
  • 59
0
votes
1 answer

Schedule a python execution every 24h

I'm training several Neural Networks on a server in my University. Due to limited resources for all the students, there is a job scheduling system called (Slurm) that queues all students runs and in addition, we are only allowed to run our commands…
mgrau
  • 51
  • 5
0
votes
1 answer

In Slurm, is it possible to assign a different number of CPUs for every task?

I am running MPI-over-openmp jobs in a Slurm cluster and I am trying to figure out a way to give different number of CPUs to each generated task. For example, let's say we run this job: srun --nodes 1 --ntasks 2 --cpus-per-task 2 ./mpi_exe This…
K. Iliakis
  • 13
  • 5
0
votes
1 answer

slurm, job allocating more CPUs than requested

I recently configure a slurm queing system for a server with one node and 72 cpus. Here the conf file: # slurm.conf file generated by configurator easy.html. # Put this file on all nodes of your cluster. # See the slurm.conf man page for more…
Laetis
  • 1,337
  • 3
  • 16
  • 28
0
votes
1 answer

equivalent of ThreadPool for Processes in Java

I am starting a number of external processes (using the class Process) from a class in Java that is itself called by a job scheduler on a Linux platform. My problem is that the job scheduler kills everything as soon as the last process has been…
Myoch
  • 793
  • 1
  • 6
  • 24
0
votes
1 answer

what will happen if the resources I required is not enough during the running of the job?

in slurm, what will happen if the resources I required is not enough during the running of the job? For example, #SBATCH --memory=10G; #SBATCH --cpus-per-task=2; python mytrain.py is in myscript.sh. After I run sbatch myscript.sh the job is…
Jingnan Jia
  • 1,108
  • 2
  • 12
  • 28
0
votes
2 answers

How to run two multiprocessing programs in one batch using SLURM?

I have SLURM cluster with several nodes with 16 vcpus per node. I've tried to run the following code: #SBATCH --nodes 2 #SBATCH --ntasks 2 #SBATCH -c 16 srun --exclusive --nodes=1 program1 & srun --exclusive --nodes=1 program2 & wait program1 and…
vachram
  • 1
  • 3
0
votes
1 answer

slurm: unable to get job's information using scontrol

When I run following command I am able to see bunch of slurm jobs. Since I can see them, I believe their log should be saved. $ sacct --format="JobID,JobName%30" JobID JobName ------------…
alper
  • 2,919
  • 9
  • 53
  • 102
0
votes
1 answer

Shell script: Wait for any process in a group to finish

I have a (small) list of n scripts that I need to submit to slurm on linux. Each script does some work and then writes output to a file. The work portion of each script executes much faster when I request 32 cores than when I request 16 or (worse) 8…
Attila the Fun
  • 327
  • 2
  • 13
0
votes
1 answer

Indices of batch array not in the good order

I have 4 files : text1005.txt text1007.txt text1009.txt text1012.txt I create list : FILES=$(find -type f -name "*txt") arr=${FILES} But when I want to print the indices, it doesn't give the good file. For example, echo…
Paillou
  • 779
  • 7
  • 16
0
votes
2 answers

How to return 0 if awk returns null from processing an expression?

I currently have a awk method to parse through whether or not an expression output contains more than one line. If it does, it aggregates and prints the sum. For example: someexpression=$'JOBID PARTITION NAME USER ST TIME NODES…
user321627
  • 2,350
  • 4
  • 20
  • 43
0
votes
0 answers

slurm running multiple openMP jobs each on a single node with script

I am looking for a simple way to run a program with various input files using slurm. I want to run each instance of the program on a single node so that it can make use of openMP. I found that probably the best way would be to use job arrays. But I…
atapaka
  • 1,172
  • 4
  • 14
  • 30