I have a new MacBook pro with the Apple M1 Max processor (10 cores total), running OS 12.2.1. I used Homebrew to install gcc:
~/homebrew/bin/gcc-11 --version
gcc-11 (Homebrew GCC 11.2.0_3) 11.2.0
Copyright (C) 2021 Free Software Foundation, Inc.
This is free software; see the source for copying conditions. There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
This package came with gfortran:
gfortran --version
GNU Fortran (Homebrew GCC 11.2.0_3) 11.2.0
Copyright (C) 2021 Free Software Foundation, Inc.
This is free software; see the source for copying conditions. There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
It also came with mpifort:
mpifort --version
GNU Fortran (Homebrew GCC 11.2.0_3) 11.2.0
Copyright (C) 2021 Free Software Foundation, Inc.
This is free software; see the source for copying conditions. There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
I have a Fortran code that uses MPI along with OpenMP. It works well and has been used on various Linux boxes and on a supercomputer. I was doing some benchmarking of the new laptop and I noticed that the overall speed of my code depends on the the combination of the number of MPI tasks (np) and OpenMP threads:
np OMP_NUM_THREADS wall time loop time
(sec) (sec)
--------------------------------------------------
1 8 2731 299.906
2 4 1816 194.753
4 2 1424 156.876
8 1 1415 156.372
In all cases, a total of 8 cores were used. This particular test had a large loop, executed 9 times. The code using pure OpenMP is almost a factor of 2 slower than the code using pure MPI. I have done the same test on a linux box (AMD Ryzen threadripper) and there was essentially no change in execution times for various combinations of np and OMP_NUM_threads, where the product np*OMP_NUM_THREADS is constant.
My compile command is
gfortran -Ofast -fopenmp -march=native -mtune=native -fmax-stack-var-size=16384
for OpenMP only, and
mpifort -Ofast -fopenmp -march=native -mtune=native -fmax-stack-var-size=16384
for the MPI hybrid code. Are there compiler flags for the OpenMP version I could use to speed things up? I have a lot of related OpenMP codes that have not yet been modified to work with MPI, so it would be nice if some compiler tweaks could help.
On the other hand, is this a case of gfortran+OpenMP for Apple M1 needing more work at a deeper level than what I can do?