Slowing down with multiple concurrent MPI LAMMPS jobs

Question

I'm running the LAMMPS simulation with AMD 2990WX (Ubuntu 18.04).

When I run only one LAMMPS job using mpirun like below.

    #!/bin/sh

    LAMMPS_HOME=/APP/LAMMPS/src
    MPI_HOME=/APP/LIBS/OPENMPI2

    Tf=0.30

    $MPI_HOME/bin/mpirun -np 8 --hostfile my_host $LAMMPS_HOME/lmp_lmp_mpi -in $PWD/../01_Annealing/in.01_Annealing -var MaxShear 0.020 -var Tf ${Tf}

I have no problem and the simulation go as I want.

But when I run the script below. Each LAMMPS job takes almost exactly 3 times of single LAMMPS job. Thus, I have no performance gain in parallel environment (since 3 jobs running in the speed of 1/3 of single job)

    #!/bin/sh

    LAMMPS_HOME=/APP/LAMMPS/src
    MPI_HOME=/APP/LIBS/OPENMPI2

    Tf=0.30

    $MPI_HOME/bin/mpirun -np 8 --hostfile my_host $LAMMPS_HOME/lmp_lmp_mpi -in $PWD/../01_Annealing/in.01_Annealing -var MaxShear 0.020 -var Tf ${Tf} &
    $MPI_HOME/bin/mpirun -np 8 --hostfile my_host $LAMMPS_HOME/lmp_lmp_mpi -in $PWD/../01_Annealing/in.01_Annealing -var MaxShear 0.025 -var Tf ${Tf} &
    $MPI_HOME/bin/mpirun -np 8 --hostfile my_host $LAMMPS_HOME/lmp_lmp_mpi -in $PWD/../01_Annealing/in.01_Annealing -var MaxShear 0.030 -var Tf ${Tf}

Without host file my_host, It's the same. the hostfile is as below:

    <hostname> slots=32 max-slots=32

I installed openmpi with --with-cuda, fftw with --enable-shared, and LAMMPS with a few packages.

I have tried openmpi v1.8, v3.0, v4.0 and fftw v3.3.8. RAM is enough and storage is also enough. I also have checked load average and core usage. They show the machine uses 24 cores (or the corresponding load) when I run the second script. When I run concurrently duplicates of the first script in separate terminal (i.e. sh first.sh in each terminal), the same problem happens.

Is there any problem with my use of bash script? or Is there any known issue with mpirun (or LAMMPS) + Ryzen?

Update

I have tested following script:

/bin/sh

LAMMPS_HOME=/APP/LAMMPS/src
MPI_HOME=/APP/LIBS/OPENMPI2

Tf=0.30

$MPI_HOME/bin/mpirun --cpu-set 0-7 --bind-to core -np 8 --report-bindings --hostfile my_host $LAMMPS_HOME/lmp_lmp_mpi -in $PWD/../01_Annealing/in.01_Annealing -var MaxShear 0.020 -var Tf ${Tf} &
$MPI_HOME/bin/mpirun --cpu-set 8-15 --bind-to core -np 8 --report-bindings --hostfile my_host $LAMMPS_HOME/lmp_lmp_mpi -in $PWD/../01_Annealing/in.01_Annealing -var MaxShear 0.025 -var Tf ${Tf} &
$MPI_HOME/bin/mpirun --cpu-set 16-23 --bind-to core -np 8 --report-bindings --hostfile my_host $LAMMPS_HOME/lmp_lmp_mpi -in $PWD/../01_Annealing/in.01_Annealing -var MaxShear 0.030 -var Tf ${Tf}

And the result shows something like:

[<hostname>:09617] MCW rank 4 bound to socket 0[core 4[hwt 0-1]]: [../../../../BB/../../../../../../../../../../../../../../../../../../../../../../../../../../..]
[<hostname>:09617] MCW rank 5 bound to socket 0[core 5[hwt 0-1]]: [../../../../../BB/../../../../../../../../../../../../../../../../../../../../../../../../../..]
[<hostname>:09617] MCW rank 6 bound to socket 0[core 6[hwt 0-1]]: [../../../../../../BB/../../../../../../../../../../../../../../../../../../../../../../../../..]
[<hostname>:09617] MCW rank 7 bound to socket 0[core 7[hwt 0-1]]: [../../../../../../../BB/../../../../../../../../../../../../../../../../../../../../../../../..]
[<hostname>:09617] MCW rank 0 bound to socket 0[core 0[hwt 0-1]]: [BB/../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../..]
[<hostname>:09617] MCW rank 1 bound to socket 0[core 1[hwt 0-1]]: [../BB/../../../../../../../../../../../../../../../../../../../../../../../../../../../../../..]
[<hostname>:09617] MCW rank 2 bound to socket 0[core 2[hwt 0-1]]: [../../BB/../../../../../../../../../../../../../../../../../../../../../../../../../../../../..]
[<hostname>:09617] MCW rank 3 bound to socket 0[core 3[hwt 0-1]]: [../../../BB/../../../../../../../../../../../../../../../../../../../../../../../../../../../..]
[<hostname>:09619] MCW rank 4 bound to socket 0[core 20[hwt 0-1]]: [../../../../../../../../../../../../../../../../../../../../BB/../../../../../../../../../../..]
[<hostname>:09619] MCW rank 5 bound to socket 0[core 21[hwt 0-1]]: [../../../../../../../../../../../../../../../../../../../../../BB/../../../../../../../../../..]
[<hostname>:09619] MCW rank 6 bound to socket 0[core 22[hwt 0-1]]: [../../../../../../../../../../../../../../../../../../../../../../BB/../../../../../../../../..]
[<hostname>:09619] MCW rank 7 bound to socket 0[core 23[hwt 0-1]]: [../../../../../../../../../../../../../../../../../../../../../../../BB/../../../../../../../..]
[<hostname>:09619] MCW rank 0 bound to socket 0[core 16[hwt 0-1]]: [../../../../../../../../../../../../../../../../BB/../../../../../../../../../../../../../../..]
[<hostname>:09619] MCW rank 1 bound to socket 0[core 17[hwt 0-1]]: [../../../../../../../../../../../../../../../../../BB/../../../../../../../../../../../../../..]
[<hostname>:09619] MCW rank 2 bound to socket 0[core 18[hwt 0-1]]: [../../../../../../../../../../../../../../../../../../BB/../../../../../../../../../../../../..]
[<hostname>:09619] MCW rank 3 bound to socket 0[core 19[hwt 0-1]]: [../../../../../../../../../../../../../../../../../../../BB/../../../../../../../../../../../..]
[<hostname>:09618] MCW rank 4 bound to socket 0[core 12[hwt 0-1]]: [../../../../../../../../../../../../BB/../../../../../../../../../../../../../../../../../../..]
[<hostname>:09618] MCW rank 5 bound to socket 0[core 13[hwt 0-1]]: [../../../../../../../../../../../../../BB/../../../../../../../../../../../../../../../../../..]
[<hostname>:09618] MCW rank 6 bound to socket 0[core 14[hwt 0-1]]: [../../../../../../../../../../../../../../BB/../../../../../../../../../../../../../../../../..]
[<hostname>:09618] MCW rank 7 bound to socket 0[core 15[hwt 0-1]]: [../../../../../../../../../../../../../../../BB/../../../../../../../../../../../../../../../..]
[<hostname>:09618] MCW rank 0 bound to socket 0[core 8[hwt 0-1]]: [../../../../../../../../BB/../../../../../../../../../../../../../../../../../../../../../../..]
[<hostname>:09618] MCW rank 1 bound to socket 0[core 9[hwt 0-1]]: [../../../../../../../../../BB/../../../../../../../../../../../../../../../../../../../../../..]
[<hostname>:09618] MCW rank 2 bound to socket 0[core 10[hwt 0-1]]: [../../../../../../../../../../BB/../../../../../../../../../../../../../../../../../../../../..]
[<hostname>:09618] MCW rank 3 bound to socket 0[core 11[hwt 0-1]]: [../../../../../../../../../../../BB/../../../../../../../../../../../../../../../../../../../..]

I don't have much knowledge about MPI but, to me, it does not show anything odd. Is there any problem?

try `mpirun --report-bindings ...` to show how your tasks are bound. you do not want MPI tasks from different "jobs" bound to the same core. You do not want to end up doing time sharing on the CPUs nor the GPUs. Note your application performance might be limited by memory bandwidth and/or file I/O, so that could also explain the performance degradation when running more that one job per node. — Gilles Gouaillardet, Nov 19 '18 at 01:32
@GillesGouaillardet I think memory bandwidth is not a problem (I had no problem when I had tested on the other cluster machine). The system used 200MiB peak memory usage, so the memory size also is not problem. Following your advice I have test `--report-bindings` it is updated to post. — Ji woong Yu, Nov 19 '18 at 05:19
From the logs, the cores are not oversubscribed. Does your app use the GPU ? Are you saying you ran the same script (e.g. 3 concurrent instances) on an other machine and you got the expected performances ? is your app writing files ? if yes, do the 3 jobs write to the same files ? — Gilles Gouaillardet, Nov 19 '18 at 06:15
@GillesGouaillardet No, the system has GPU but the LAMMPS is not compiled to use GPU. Only CPU is being used (Openmpi is with `--with-cuda` option). On the cluster computer (i.e. the other machine) I have no performance issue. Other than the cluster computer, I haven't tried. Both 3 same job or 3 distinct jobs gave me the same performance problem. The same is for the case when the jobs are writing/reading. Both Writing/reading to one file or separate file gave me the same issue. For the I/O speed, I think this is also not the issue. For both NVME SSD and HDD gave me the same issue. — Ji woong Yu, Nov 19 '18 at 07:46
do you use the very same submission script on the cluster computer ? — Gilles Gouaillardet, Nov 19 '18 at 08:12
@GillesGouaillardet Actually, it's not the "very" same since the locations of the library directories are different. I have to change manually edit the path of openmpi and application (LAMMPS) to run the same script (except that other things are the same). And, on the cluster, the libraries are not built by me but the system manager. So I don't know what's different from the AMD machine I have. However, aside from the discussion whether such comparison is strictly done, or not, I cannot understand why there is performance drop when they use different cores. — Ji woong Yu, Nov 19 '18 at 11:09
Are both systems single socket with 24 cores ? To me, you focus on cores (CPU bound applications) when most apps are memory bound. — Gilles Gouaillardet, Nov 19 '18 at 11:48
@GillesGouaillardet No the cluster has various configurations but most of the slave nodes have 2 socket cpus (2*12=24 cores). I don't understand what do you mean by cpu bound or memory bound? you mean the `--bind-to core` option? That's just one of the options that I have tested. I don't mind any option if my simulation runs as I wanted. Maybe what you mean by 'bound' is something like the apps are limited by the resources they have. The reason why I do it in three separate jobs is that the performance saturates when 8 cores are used. I have many parameters to test. — Ji woong Yu, Nov 19 '18 at 15:40
@GillesGouaillardet But `-np 24` just making the app runs 20% faster. So I decided to run 8 cores * 3 jobs (actually there is more but this is just test). — Ji woong Yu, Nov 19 '18 at 15:45
@GillesGouaillardet Ahh, I forgot one thing. The AMD cpu (2990WX) is 1-socket cpu and has 32 core (64 threads), but I intentionally don't utilize remaining 8 cores ( 32 - 8*3 = 8) until the test is done. — Ji woong Yu, Nov 19 '18 at 16:04
"cpu-bound" means the performance is limited by the number crunching power of one core. "memory-bound" means the performance is limited by the memory bandwidth. If your application is cpu-bound, then running multiple instances on different set of cores should not impact the performance. If your app is memory-bound, and since your workstation is single socket, you end-up sharing the memory bandwidth *and* the shared L3 cache. On a single socket, each app has 1/3 the shared memory cache and bandwidth, but on a dual socket, each app has 2/3 the shared memory cache and bandwidth. — Gilles Gouaillardet, Nov 19 '18 at 23:57
try running the HPL benchmark (cpu-bound) and the STREAM benchmarks (memory-bound). STREAM perf will degrade with 3 jobs per socket, but HPL performance should remain constant even with 3 jobs per socket (and if not, there is definitely something fishy with your setup) — Gilles Gouaillardet, Nov 19 '18 at 23:59

Slowing down with multiple concurrent MPI LAMMPS jobs

Update

0 Answers0