Questions tagged [slurm]

Slurm (formerly spelled SLURM) is an open-source resource manager designed for Linux HPC clusters of all sizes.

Slurm: A Highly Scalable Resource Manager

Slurm is an open-source resource manager designed for Linux clusters of all sizes. It provides three key functions. First it allocates exclusive and/or non-exclusive access to resources (computer nodes) to users for some duration of time so they can perform work. Second, it provides a framework for starting, executing, and monitoring work (typically a parallel job) on a set of allocated nodes. Finally, it arbitrates contention for resources by managing a queue of pending work.

Slurm's design is very modular with dozens of optional plugins. In its simplest configuration, it can be installed and configured in a couple of minutes (see Caos NSA and Perceus: All-in-one Cluster Software Stack by Jeffrey B. Layton) and was used by Intel on their 48-core "cluster on a chip". More complex configurations can satisfy the job scheduling needs of world-class computer centers and rely upon a MySQL database for archiving accounting records, managing resource limits by user or bank account, or supporting sophisticated job prioritization algorithms.

While other resource managers do exist, Slurm is unique in several respects:

  • It is designed to operate in a heterogeneous cluster counting over 100,000 nodes and millions of processors.
  • It can sustain a throughput rate of hundreds of thousands jobs per hour with bursts of job submissions at several times that rate.
  • Its source code is freely available under the GNU General Public License.
  • It is portable; written in C and using the GNU autoconf configuration engine. While initially written for Linux, other UNIX-like operating systems should be easy porting targets.
  • It is highly tolerant of system failures, including failure of the node executing its control functions.
  • A plugin mechanism exists to support various interconnects, authentication mechanisms, schedulers, etc. These plugins are documented and simple enough for the motivated end user to understand the source and add functionality.
  • Configurable node power control functions allow putting idle nodes into a power-save/power-down mode. This is especially useful for "elastic burst" clusters which expand dynamically to a cloud virtual machine (VM) provider to accommodate workload bursts.

Resources and Tutorials:

Name Spelling

As of v18.08, the name spelling “SLURM” has been changed to “Slurm” (commit 3d7ada78e).

Other Uses of the Name

Slurm also a fictional soft drink in the Futurama multiverse where it is popular and highly addictive.

1738 questions
0
votes
2 answers

Using sbatch to write to file

I'm new to slurm, and I'm trying to batch a shell script to write to a text file. My shell script (entitled "troublesome.sh") looks like this: #!/bin/bash #SBATCH -N 1 #SBATCH -n 1 echo "It worked!" When I run sh troublesome.sh > doeswork.txt it…
Kate
  • 1
  • 1
  • 2
0
votes
1 answer

What is the difference between a normal user and a user created by sacctmgr for some account?

There are some users (listed in /etc/passwd) who can use Slurm to submit jobs in our cluster. But, with sacctmgr we can also define users belonging to some account(s). What should be the connection of these two group of users? Thanks.
potant
  • 7
  • 1
0
votes
0 answers

How do you import python modules with Slurm?

This is my shell script for sbatch: #!/bin/bash #SBATCH ..... #SBATCH ..... #SBATCH ..... module load python/3.7.1 srun python runcomb.py Im running a python script on a HPC which uses SLURM. For some reason, even basic python modules aren't being…
0
votes
1 answer

Running MPI job on multiple nodes with slurm scheduler

I'm trying to run an MPI application with a specific task/node configuration. I need to run a total of 8 MPI tasks 4 of which on one node and 4 on another node. This is the script file I'm using: #!/bin/bash #SBATCH --time=00:30:00 #SBATCH…
0
votes
1 answer

How can I execute a SLURM script from within multiple directories simultaneously?

I want to execute a SLURM script from within multiple directories simultaneously. More specifically, I have ten array folders numbered array_1 through array_10 from which I want to execute the script. Within each of these directories, the script…
0
votes
0 answers

Mpi Bcast Error while using Calcul Canada

I am trying to run a calculation on the calcul canada remote site from my MAC. this is the input file run.sh I am using: #!/bin/sh #SBATCH --nodes=4 #SBATCH --ntasks-per-node=32 #SBATCH --time=24:00:00 #SBATCH --mem-per-cpu=2000M #SBATCH…
coralie
  • 1
  • 1
0
votes
1 answer

How to distribute custom code through SLURM manager?

I have access to a computer cluster with the SLURM manager. I want to achieve that different nodes execute different parts of my code. If I understood properly, this can be achieved through SLURM with the srun command if your code is properly…
Nevena
  • 11
  • 3
0
votes
1 answer

How to run a python code with multiple inputs on a same node with slurm id?

I want to run a python program for 10 times and save different output files as output_1, output_2, output_3.....and so on. It can be run using 1 processor and 10 threads. I have access to 96 CPUs on a node, so, I want to perform all these 10 jobs in…
0
votes
1 answer

compiler does not utilize all CPU, I need your advice

My PC have two cpu xeon e5-2678v3, 12 cores/24 thread each cpu (total 24core/48 threads) I submitted slurm batch job that request multicores for my code (CFD fortran code with intel fortran compiler in linux) The code run well but it seem that all…
0
votes
0 answers

Does the file get changed in squeue if I modify after being sent into queue?

I have a question: a have neural net file model.py with some parameters set. I have sent it to the slurm queue. When doing squeue I can see that it is still waiting because there are other jobs running. Now, I want to send another variation of…
alienflow
  • 400
  • 7
  • 19
0
votes
1 answer

Dereference error when accessing Slurm job resources using C API

I am trying to get memory usage information for each job in the Slurm cluster using C API: #include #include #include #include "slurm/slurm.h" #include "slurm/slurm_errno.h" int main(int argc, char** argv) { …
mac13k
  • 2,423
  • 23
  • 34
0
votes
1 answer

Job array step single execution

I have a sbatch script to submit job arrays to Slurm with different steps: #!/bin/bash #SBATCH --ntasks 1 #SBATCH --nodes 1 #SBATCH --time 00-01:00:00 #SBATCH…
Bub Espinja
  • 4,029
  • 2
  • 29
  • 46
0
votes
1 answer

Has anyone successfully used shopt -s extglob (extended globbing) in bash with SBATCH settings on a HPC?

To summarise: I am using bash shell, version: 4.2.46(2)-release I want to submit a batch job script to slurm job scheduler where, in the script I use extended globbing which is turned on using shopt -s extglob on a separate line to the extended…
0
votes
1 answer

Interpretation of output from sacct: meaning of ex+

I would like to know if a job is using one or two CPUs, based on the interpreation of the following sacct. I have searched documentation about the meaning of the ex+ row but found nothing: how should I interpret that row? JobID State …
andrea m.
  • 668
  • 7
  • 15
0
votes
1 answer

Hot to add extensions when running NetLogo headlessy on a cluster?

I am using a common Netlogo extension, "CSV", to read a table. The job fails because it cannot find the extension (although I am sure the extension file is present). How do I specify that I want to use an extension when working with Netlogo…
Andrea
  • 529
  • 5
  • 10
1 2 3
99
100