OpenMP + Fortran on Apple M1 is slower than MPI+Fortran

Question

I have a new MacBook pro with the Apple M1 Max processor (10 cores total), running OS 12.2.1. I used Homebrew to install gcc:

~/homebrew/bin/gcc-11 --version
gcc-11 (Homebrew GCC 11.2.0_3) 11.2.0
Copyright (C) 2021 Free Software Foundation, Inc.
This is free software; see the source for copying conditions.  There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.

This package came with gfortran:

gfortran --version
GNU Fortran (Homebrew GCC 11.2.0_3) 11.2.0
Copyright (C) 2021 Free Software Foundation, Inc.
This is free software; see the source for copying conditions.  There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.

It also came with mpifort:

mpifort --version
GNU Fortran (Homebrew GCC 11.2.0_3) 11.2.0
Copyright (C) 2021 Free Software Foundation, Inc.
This is free software; see the source for copying conditions.  There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.

I have a Fortran code that uses MPI along with OpenMP. It works well and has been used on various Linux boxes and on a supercomputer. I was doing some benchmarking of the new laptop and I noticed that the overall speed of my code depends on the the combination of the number of MPI tasks (np) and OpenMP threads:

np     OMP_NUM_THREADS     wall time    loop time
                           (sec)        (sec)
--------------------------------------------------
1      8                   2731         299.906
2      4                   1816         194.753
4      2                   1424         156.876   
8      1                   1415         156.372

In all cases, a total of 8 cores were used. This particular test had a large loop, executed 9 times. The code using pure OpenMP is almost a factor of 2 slower than the code using pure MPI. I have done the same test on a linux box (AMD Ryzen threadripper) and there was essentially no change in execution times for various combinations of np and OMP_NUM_threads, where the product np*OMP_NUM_THREADS is constant.

My compile command is

gfortran -Ofast -fopenmp -march=native -mtune=native -fmax-stack-var-size=16384

for OpenMP only, and

mpifort -Ofast -fopenmp -march=native -mtune=native -fmax-stack-var-size=16384

for the MPI hybrid code. Are there compiler flags for the OpenMP version I could use to speed things up? I have a lot of related OpenMP codes that have not yet been modified to work with MPI, so it would be nice if some compiler tweaks could help.

On the other hand, is this a case of gfortran+OpenMP for Apple M1 needing more work at a deeper level than what I can do?

Welcome, I suggest taking the [tour] and perhaps also reading [ask]. It would be good to see the code. Otherwise it is almost impossible to say anything specific. OpenMP can have memory locality and false sharing issues that are easier to prevent in MPI (where memory is distributed). I am not too surprised that MPI is faster for you. Are those cores real physical cores with separate FPUs? What exact CPU model do you use?I am afraid not much can be said without the code. — Vladimir F Героям слава, Feb 23 '22 at 16:08
Thanks for the response. The source code files have a total of 136,916 lines. Not all of these lines are code of course, but still it is large. As for the Apple M1, I believe there are 8 "high performance" cores and 2 "efficiency" cores. As far as I know they are physical. I could see how RAM issues come into play here. I was just surprised to see how much faster MPI is than OpenMP on the M1, whereas there is essentially no difference for the AMD. — Jerome Orosz, Feb 23 '22 at 16:18
If your code depends on cache behavior, make sure to set `OMP_PROC_BIND=true` and `OMP_PLACES=cores`. — Victor Eijkhout, Feb 23 '22 at 17:04
I set OMP_PROC_BIND=true and OMP_PLACES=cores and saw no difference overall. I have at best a feeble understanding of hardware, but the Apple M1 chips apparently have a rather different arrangement for the memory. OpenMP is shared memory and MPI is distributed memory, so perhaps it should not be surprising to see differences in run times. However, a factor of 2 difference really seems large to me. — Jerome Orosz, Feb 23 '22 at 17:18
Ideally you could provide a [minimal reproducible example](https://stackoverflow.com/help/minimal-reproducible-example). In this case it will not be quite as reproducible b/c not everyone has a M1, but at least if it is an issue with your code, it could be found. If this is a problem with the gcc implementation of OpenMP for M1, it wont be that helpful, I guess. — paleonix, Feb 23 '22 at 18:35
give LLVM (provided by `brew`) a try, GNU OpenMP runtime is syscall based on Linux, and it can have a pretty high overhead on some workflows (never tried on M1 though). Unlike Linux, I do not think OSX allows an app to pin threads (or processes) to a specific (set of) core. That should not make a difference between MPI and OpenMP though. — Gilles Gouaillardet, Feb 24 '22 at 11:33
I have a similar problem (C++ with OpenMP) on Apple Silicon, possibly related: https://stackoverflow.com/questions/73591488/c-code-compiled-using-homebrews-clang-on-macos-apple-silicon-runs-significant — Stefan, Sep 03 '22 at 14:29

OpenMP + Fortran on Apple M1 is slower than MPI+Fortran

0 Answers0