0

My program uses MPI+pthreads, where n-1 MPI processes are pure MPI code whereas the only one MPI process uses pthreads. The last process contains only 2 threads( main thread and pthread ). Suppose that the HPC cluster I want to run this program on consists of compute nodes, each of which has 12 cores. How should I write my batch script to maximise utilization of the hardware?

Following is my batch script I wrote. I use export OMP_NUM_THREADS=2 because the last MPI process has 2 threads and have to assume that the others have 2 threads each as well.

Then I allocate 6 MPI processes per node, so each node can run 6xOMP_NUM_THREADS = 12(=the number of cores on each node) threads despite the fact that all MPI processes but one have 1 thread.

#BSUB -J LOOP.N200.L1000_SIMPLE_THREAD
#BSUB -o LOOP.%J
#BSUB -W 00:10
#BSUB -M 1024
#BSUB -N
#BSUB -a openmpi
#BSUB -n 20
#BSUB -m xxx
#BSUB -R "span[ptile=6]"
#BSUB -x

export OMP_NUM_THREADS=2

How can I write a better script for this ?

cpp_noname
  • 2,031
  • 3
  • 17
  • 30

2 Answers2

2

The following should work if you'd like the last rank to be the hybrid one:

#BSUB -n 20
#BSUB -R "span[ptile=12]"
#BSUB -x

$MPIEXEC $FLAGS_MPI_BATCH -n 19 -x OMP_NUM_THREADS=1 ./program : \
         $FLAGS_MPI_BATCH -n 1  -x OMP_NUM_THREADS=2 ./program

If you'd like rank 0 to be the hybrid one, simply switch the two lines:

$MPIEXEC $FLAGS_MPI_BATCH -n 1  -x OMP_NUM_THREADS=2 ./program : \
         $FLAGS_MPI_BATCH -n 19 -x OMP_NUM_THREADS=1 ./program

This utilises the ability of Open MPI to launch MIMD programs.

You mention that your hybrid rank uses POSIX threads and yet you are setting an OpenMP-related environment variable. If you are not really using OpenMP, you don't have to set OMP_NUM_THREADS at all and this simple mpiexec command should suffice:

$MPIEXEC $FLAGS_MPI_BATCH ./program

(in case my guess about the educational institution where you study or work turns out to be wrong, remove $FLAGS_MPI_BATCH and replace $MPIEXEC with mpiexec)

Hristo Iliev
  • 72,659
  • 12
  • 135
  • 186
  • Thanks a lot !!!. So does this mean that the two threads of the last rank will run on any two of the available cores on the second compute node ? Without extra information about the two threads, is it possible that LSF may schedule these two threads on the same cpu core? – cpp_noname Oct 31 '14 at 10:53
  • LSF usually schedules to slots, not to cores. Binding (that is restricting process scheduling to certain logical CPUs) is usually done by the MPI implementation or by the OpenMP runtime (or by both for hybrid programs). – Hristo Iliev Oct 31 '14 at 13:24
1

It's been awhile since I've used LSF, so this might not be totally correct, so you should experiment with it.

I read your request

#BSUB -n 20
#BSUB -R "span[ptile=6]"

as, a total of 20 tasks, with 6 tasks per node. Meaning you will get 4 nodes. Which seems a waste, as you stated the each node has 12 cores.

How about using all the cores on the nodes, as you have requested exclusive hosts (-x)

#BSUB -x
#BSUB -n 20
#BSUB -R "span[ptile=12]"

export OMP_NUM_THREADS=2

This way you know rank

  • 0..11 is on the first host
  • 12..19 is on the second host

where by the second host has spare slots, to make use of the OpenMP'ness of rank 19.

Of course if you are getting into even funnier placements, LSF allows you to shape the job placement. Using LSB_PJL_TASK_GEOMETRY.

Lets say you had 25 MPI tasks with rank number 5 using 12 cores

#BSUB -x
#BSUB -n 25
#BSUB -R "span[ptile=12]"

export LSB_PJL_TASK_GEOMETRY="{(0,1,2,3,4,6,7,8,9,10,11,12)\
                               (13,14,15,16,17,18,19,20,21,22,23,24)\
                               (5)}"

This way, task 5 gets it's own node.

Timothy Brown
  • 2,220
  • 18
  • 22
  • Thank you very much. But I still dont quite understand the use of #BSUB -n. Is it used to reserve the given number of hardware compute slots or to indicate the number of parallel processes? And does #BSUB -n distinguish between threads and MPI processes? – cpp_noname Oct 30 '14 at 22:35
  • @cpp_noname, `-n` specifies the number of _slots_. In your case each compute node provides 12 slots and that corresponds to the number of CPU cores on the node (but note that slots do not map directly to cores). Each job could use its slots as it sees fit. Roughly speaking, MPI jobs launch one process in each slot and OpenMP jobs launch one thread in each slot. – Hristo Iliev Oct 31 '14 at 08:02
  • @HristoIliev, suppose that my program uses 6 MPI processes, each of which has 3 threads. So in total I have 6x3=18 parallel threads. Then what should n be ? #BSUB -n 6 or #BSUB -n 18 – cpp_noname Oct 31 '14 at 10:59
  • `-n 6 -R "span[ptile=4]" -x` in order to get 6 slots, have 4 MPI processes per node and have the nodes exclusively. This will result in the following distribution: `4x3+2x3` (MPI processes x threads). Note that this only applies to CPU-intensive threads. On InfiniBand clusters Open MPI spawns two additional threads per process. Those two do not use the CPU much and could therefore share a single core with the main thread and you don't have to request special slots for them. – Hristo Iliev Oct 31 '14 at 13:27