I have this large library with a mix of regular C++, a lot of SSE intrinsics and a few insignificant pieces of assembly. I have reached the point where I would like to target the AVX instruction set.
To do that, I would like to build the whole thing with gcc's -mavx
or MSVC's /arch:AVX
so I can add AVX intrinsics wherever they are needed and not have to worry about AVX state transitions internally.
The only problem I've found with that is the standard C math functions: sin()
, exp()
, etc. Their implementation on linux systems uses SSE instructions without the VEX prefix. I have not checked but I expect a similar problem on windows.
The code uses a fair amount of calls to math functions. Some quick benchmarking reveals that a simple call so sin()
gets either slightly (~10%) slower or much (3x) slower depending on the exact CPU and how it handles the AVX transitions (Skylake vs older).
Adding a VZEROUPPER
before the call helps pre Skylake CPUs a lot but actually makes the code a little slower on Skylake. It seems like the proper solution would be a VEX encoded version of the math functions.
So my question is this: Is there a reasonably efficient math library which can be compiled to use VEX encoded instructions? How do others deal with this problem?