I'm working on a project to add AMD blis to a product that currently uses MKL and intel omp.
Whilst I see some testcases showing improvement, there are some that are a lot worse.
After profiling I see the AMD version spending more time in gomp barrier and pthread functions than the Intel version spends in iomp kmp functions.
I don't have much experience with OMP. I was wondering where the build options used for OMP might have much impact. This is with a locally build GCC 11.2 which uses
GNU C17 11.2.0 -mtune=generic -march=x86-64 -g -O2 -ftls-model=initial-exec
Does gomp have any march optimizations to speed up barriers?