4

I have this large library with a mix of regular C++, a lot of SSE intrinsics and a few insignificant pieces of assembly. I have reached the point where I would like to target the AVX instruction set.

To do that, I would like to build the whole thing with gcc's -mavx or MSVC's /arch:AVX so I can add AVX intrinsics wherever they are needed and not have to worry about AVX state transitions internally.

The only problem I've found with that is the standard C math functions: sin(), exp(), etc. Their implementation on linux systems uses SSE instructions without the VEX prefix. I have not checked but I expect a similar problem on windows.

The code uses a fair amount of calls to math functions. Some quick benchmarking reveals that a simple call so sin() gets either slightly (~10%) slower or much (3x) slower depending on the exact CPU and how it handles the AVX transitions (Skylake vs older).

Adding a VZEROUPPER before the call helps pre Skylake CPUs a lot but actually makes the code a little slower on Skylake. It seems like the proper solution would be a VEX encoded version of the math functions.

So my question is this: Is there a reasonably efficient math library which can be compiled to use VEX encoded instructions? How do others deal with this problem?

Peter Cordes
  • 328,167
  • 45
  • 605
  • 847
Olivier
  • 1,144
  • 1
  • 8
  • 15
  • They did not re-implement the math functions in AVX. There was no point, they can do that in the processor design to improve the existing instruction. – Hans Passant Mar 26 '18 at 17:26
  • 4
    @HansPassant: I do not understand your point. Who is “they”? The first sentence seems to be about people who write the math library. The second sentence seems to be about people who design processors. Those are generally different people. And I do not see how improving existing instructions would preclude implementing the math library in AVX or at least making provisions for avoiding AVX-SSE transition penalties. – Eric Postpischil Mar 26 '18 at 17:46
  • @HansPassant: I don't think there's any indication that Intel is planning to remove SSE/AVX transition penalties entirely in the future, in their mainstream CPUs. (KNL is different: Agner Fog says there's a P6-style partial register *stall* if an SSE instruction writes an XMM register, and then you read the YMM or ZMM register. So you will have a bad time only if you do something like a ZMM shuffle to merge data into the high elements of an SSE function result. `vmovaps xmm1, xmm0` should avoid a problem by making a zero-extended copy.) But anyway, KNL is totally different from SKL. – Peter Cordes Mar 26 '18 at 21:17
  • Shouldn't it be relatively easy to download (e.g.) [glibc](https://www.gnu.org/software/libc/) and compile it with AVX support? – chtz May 02 '18 at 20:14
  • @chtz relatively is the key word. I'd have to disable the assembly parts. And I doubt it would be easy to build for windows. It's also not that fast to begin with. The LGPL adds more trouble too. I would consider it a last resort if nothing else can be found but I was hoping someone had already hit that problem and done some of the work. – Olivier May 04 '18 at 02:58
  • Also musl libc appears to have code which would be easy to build but the core algorithms are 25 years old so I doubt most of the choices made then are good for today's CPUs. – Olivier May 04 '18 at 03:10
  • Another alternative would be [cephes](http://www.netlib.org/cephes/). This claims to be written in pure C. But that also is quite old and has a very vague [license](http://www.netlib.org/cephes/readme). – chtz May 04 '18 at 09:14

0 Answers0