Basic Slurm questions

Question

I have been using a cluster to do some heavy computing. There are a few things I do not understand. For instance, I have used this configuration for all my job so far

#SBATCH -J ss      
#SBATCH -N 1         # allocate 1 nodes for the job
#SBATCH -n 15        # 20 tasks total
#SBATCH -t 12:0:0    
#SBATCH --mem=12000

However, I do not know if a node is a computer (-N 1) and what is a task (-n 15).

My codes are MPI but ideally I want to do a hybrid MPI and OpenMP. How should I configure my SBATCH to do so?

Thank you.

David Daverio · Answer 1 · 2019-02-23T19:22:12.963

1

A cluster is a group of node, each node is an independent computer (a bunch of CPUs and some GPUs or other accelerators), then the nodes are connected by a network (it worth noting that the memory addresses are usually global in supercomputers). Then you have two type of supercomputer: shared memory and distributed memory.

It worth reading a bit on supercomputer architecture... Wikipedia is a good starting point!

A process is an independent work unit. process does not share memory, they need a way to access the memory of each other, to do so you use library such as MPI.

In slurm a process is called a task...

To set the number of tasks (processes in fact) you use -ntasks or simply -n Then you can set the number of task per node or the number of node. This are 2 different things!

--ntasks-per-node give you the number of task per node --nodes gives you the minimum number of nodes you want. If you specify that --nodes=2 it means you will have minimum 2 nodes, but it might be more... if your nodes have 18 cores, and you ask for 40 tasks, then you need at least 3 nodes... thats why one should avoid using --nodes (except if you know what you are doing!)

Then a given number of CPU (cores of your processor) can be allocated to a single task. this is set using --cpu-per-task.

One MPI rank is one task. Then a task can launch multiple thread. If you set --cpu-per-task to one, all those thread will run on the same core. And therefore compete for the resource. Usually you want to have one thread per core (or 2 if you use hyperthreading).

When you set --cpu-per-task, it HAVE TO be a smaller number of core per node, as a task can run only on a single node! (on a distributed memory system).

To summarize:

So if you want to run M mpi processes which will lunch N thread each. First N must be smaller than the number of core per node, better to be a integer divider of the number of core per node (otherwise you will waist some cores).

You will set: --ntasks="M" --cpus-per-task="N"

Then you will run using: srun ./your_hybrid_app

Then do not forget 2 things: If you use OpenMP: Set the number of thread:

export OMP_NUM_THREADS="N"

and dont forget to initialize MPI properly for multithreading...

!/bin/bash -l
#
#SBATCH --account=myAccount
#SBATCH --job-name="a job"
#SBATCH --time=24:00:00
#SBATCH --ntasks=16
#SBATCH --cpus-per-task=4
#SBATCH --output=%j.o
#SBATCH --error=%j.e

export OMP_NUM_THREADS=4
srun ./your_hybrid_app

This will lauch 16 tasks, with 4 cores per task (and 4 OMP threads per task, so one per core).

edited Feb 23 '19 at 19:22

answered Feb 23 '19 at 18:56

David Daverio

323
1
11

The first paragraph is a bit wonky. On distributed memory supercomputers (and most of the world's supercomputers do have distributed memory) addresses are not global. – High Performance Mark Feb 23 '19 at 19:40
Sorry to tell you that but Cray, IBM, and Bull uses global adresses for memory. (but it does no means that a node can access all those adresses (on cray I think yes) – David Daverio Feb 23 '19 at 19:59
That's false. It depends on the specific architecture of the computer, not on the vendor. I've been working for 7 years with an IBM supercomputer and for 9 years on a Bull one, and both have no global addresses. – Poshi Feb 23 '19 at 21:16
Blue Gene Q, XE6, XC30/40/50, XK7 have for sure global memory addresses. (I checked the spec...). Then all cluster which support RDMA needs a global address system... But who cares! – David Daverio Feb 24 '19 at 02:34
what do you exactly mean by "memory addresses are usually global" ? IMHO "MPI provides a way to access so processes can access the memory of remote processes" is a bit misleading (sure one-sided can do that, but MPI is mainly used for message passing). Do you really **have to** set `OMP_NUM_THREADS` with `srun` ? If not, you'd rather not set it to avoid mistakes when `--cpus-per-task` end up being different than `OMP_NUM_THREADS`. Most hybrid MPI+OpenMP apps do not invoke MPI within OpenMP regions, and hence do not require `MPI_THREAD_MULTIPLE` (that can have a negative impact on performance) – Gilles Gouaillardet Feb 24 '19 at 02:41
Usually I set: OMP_NUM_THREADS=$SLURM_CPUS_PER_TASK its safer than using a number... I think that slurm set OMP_NUM_THREADS, but I have seen some weird behaviour, especially on "small hand made" clusters. Indeed MPI_THREAD_FUNNELED is usually the best choice in term of perf. Then I would add, more that 4-8 thread per MPI rank is usually a bad idea in term of perf. – David Daverio Feb 24 '19 at 02:59
I do not think SLURM sets `OMP_NUM_THREADS`. If it performs cpu binding, then the OpenMP runtime will likely figure out the MPI tasks was given `$SLURM_CPUS_PER_TASK` cpus to run one, and start the same number of OpenMP threads (so you can observer the same behaviour as if `OMP_NUM_THREADS` were set). Most apps will suffer performance degradation if running OpenMP threads across multiple sockets, so if SLURM does task binding, threads should be packed on the same socket (vs scattered on multiple sockets), and the enduser should not ask more threads than cores per socket (and not per node). – Gilles Gouaillardet Feb 24 '19 at 03:20
Please edit the post for the sockets... But then one should not only set SLURM_CPUS_PER_TASK to be smaller than number of core per socket... 2 socket, 12 core per socket, 8 cpu per task... slurm will deliver 3 mpi task with one spread over the 2 sockets. To avoid this one have to set also --nodes. To ensure a minimum number of node used (with the risk of wasting some cores...). ANd all this start to be a bit advanced in respect to the question lol – David Daverio Feb 24 '19 at 03:32

score -2 · Answer 2 · answered Feb 23 '19 at 09:19

A node is a computer, and a task is every binary that is loaded into memory (in MPI, several times the same binary). If those binaries also perform OpenMP or threading (any kind of multiprocessing in the same node), the you also have to inform how many CPUs will use each task.

Basic Slurm questions

2 Answers2