Program strongly scaling on 1 node, large increase in runtime using 2 nodes

Question

The results show that as I increase the number of processors from 2 to 4 to 10 the runtime decreases each time, but when I get to 20 processors there is a large increase in runtime. Each node has two 8-core processors, so I want to limit each node to 16 mpi processes. Am I doing this correctly? I think the problem may have to do with my sbatch file. And especially since the large increase in runtime occurs when I go from using one node to two. Here is my sbatch file:

#!/bin/bash -x
#SBATCH -J scalingstudy
#SBATCH --output=scalingstudy.%j.out
#SBATCH --error=scaling-err.%j.err
#SBATCH --time=03:00:00
#SBATCH --partition=partition_name
#SBATCH --mail-type=end
#SBATCH --mail-user=email@school.edu

#SBATCH -N 2
#SBATCH --ntasks-per-node=16

module load gcc/4.9.1_1
module load openmpi/1.8.1_1

mpic++ enhanced_version.cpp

mpirun -np 2 ./a.out 10000
mpirun -np 4 ./a.out 10000
mpirun -np 10 ./a.out 10000
mpirun -np 20 --bind-to core ./a.out 10000

mpirun -np 2 ./a.out 50000
mpirun -np 4 ./a.out 50000
mpirun -np 10 ./a.out 50000
mpirun -np 20 --bind-to core ./a.out 50000

mpirun -np 2 ./a.out 100000
mpirun -np 4 ./a.out 100000
mpirun -np 10 ./a.out 100000
mpirun -np 20 --bind-to core ./a.out 100000

mpirun -np 2 ./a.out 500000
mpirun -np 4 ./a.out 500000
mpirun -np 10 ./a.out 500000
mpirun -np 20 --bind-to core ./a.out 500000

mpirun -np 2 ./a.out 1000000
mpirun -np 4 ./a.out 1000000
mpirun -np 10 ./a.out 1000000
mpirun -np 20 --bind-to core ./a.out 1000000

Did you try the `-ppn` flag? That gives you a way to specify how many ranks will go on each node. Your problem might be that you're putting 8 ranks on one node and two on the other (depending on the default mapper). — Wesley Bland, May 18 '15 at 15:52
@WesleyBland, `#SBATCH --ntasks-per-node=16` is the SLURM's equivalent to `-ppn` — Hristo Iliev, May 18 '15 at 16:09
Right, so Learning_MPI might want to make that number smaller when using <32 processes. — Wesley Bland, May 18 '15 at 16:23
I doubt that anyone could answer your question in its current state. It could be due to improper timing methods, or due to badly written algorithm, or due to high sensitivity to latency... What does `a.out` do? What a "large increase in runtime" constitutes? Why are you only using `--bind-to core` in the case with 20 processes but not in the others (though Open MPI 1.8.x binds to core by default)? — Hristo Iliev, May 19 '15 at 08:04
I talked to my research advisor and he believes the increase in communication overhead when using multiple nodes is due to the message having to go over the interconnected network that connects nodes vs just having the message being copied on the shared memory on a single node. @Hristo lliev, I guess having --bind-to core is redundant since I have --ntasks-per-node already? I just wanted to make sure that since I was going over 16 tasks that only 1 task was on each core. — Learning_MPI, May 21 '15 at 16:20
`--bind-to core` instructs Open MPI to bind (pin) each MPI process to a different CPU core. It has nothing to do with the `--ntasks-per-node` option of SLURM, though SLURM could be configured to perform the binding itself. Anyway, Open MPI binds the processes by default in version 1.8 so the option changes nothing. — Hristo Iliev, May 23 '15 at 16:25

Program strongly scaling on 1 node, large increase in runtime using 2 nodes

0 Answers0