3

Is there any sort of difference in precision or performance between normal sqrtps/pd or the SVML version:

     __m128d _mm_sqrt_pd (__m128d a) [SSE2]
     __m128d _mm_svml_sqrt_pd (__m128d a) [SSE?]
     __m128 _mm_sqrt_ps (__m128 a) [SSE]
     __m128 _mm_svml_sqrt_ps (__m128 a) [SSE?]

I know that SVML Intrinsics like _mm_sin_ps are actually functions consisting of potentially multiple asm instructions, thus they should be slower than any single multiply or even divide. However, I'm curious as to why these function exist if there are hardware-level Intrinsics available.

Were these SVML functions created before SSE2? Or is there a difference in precision?

Peter Cordes
  • 328,167
  • 45
  • 605
  • 847
dave_thenerd
  • 448
  • 3
  • 10
  • If you have a compiler that supports these, you could use one in a `__m128 foo(__m128 v)` wrapper and see if it inlines to `sqrtps xmm0,xmm0` / `ret` or not. IDK what the point of those SVML versions would be. – Peter Cordes Sep 28 '21 at 01:05
  • Hard to imagine they're provide a `__m128d` type and a sqrt for it on a CPU without SSE2. Most compilers don't even define `__m128d` if SSE2 isn't enabled (try a 32-bit build with `-mno-sse2`), or in compilers like ICC and MSVC using `__m128d` implies using SSE2. And it would be pretty pointless to have SIMD vector support for doing emulated double-precision sqrts but not add / sub / mul / div, and being emulated it would be slower than just using x87 for 2 elements separately. So I think we can rule out that guess. – Peter Cordes Sep 28 '21 at 01:08

1 Answers1

2

I've inspected the code gen in MSVC.

  • _mm_svml_sqrt_pd compiles into a function call; the called function consists of a single sqrtpd followed by ret
  • _mm_svml_sqrt_ps compiles into a function call; the called function consists of a single sqrtps followed by ret
  • _mm_sqrt_pd and _mm_sqrt_ps intrinsics compile to inlined sqrtpd and sqrtps

A possible explanation (just guess): SVML intended to have CPU dispatch, but the version compiled for MSVC has this CPU dispatch disabled. The goal may be to implement it differently for Xeon Phi, the Xeon Phi version may be not included in MSVC build of SVML.


Screenshot: enter image description here


When using Intel compiler, it is using svml_dispmd.dll, and there's actual dispatch function (real indirect jump ff 25 42 08 00 00), which ends up in vsqrtpd for me

Alex Guteniev
  • 12,039
  • 2
  • 34
  • 79
  • Is the indirect function call just the normal DLL mechanism because the code is in the SVML DLL? I assume you tested on a machine with AVX, and it still ran `sqrtpd` not `vsqrtpd`, so it sounds like it really is just that dumb and they never should have provided these functions. – Peter Cordes Oct 02 '21 at 18:36
  • Hmm, I wonder if a Xeon Phi version could use AVX512ER [`vrsqrt28ps`](https://www.felixcloutier.com/x86/vrsqrt28ps) to get an approximation that doesn't need a Newton-Raphson iteration for single-precision? IIRC, actual sqrt instructions are quite slow on KNL and you're intended to use AVX512ER with [`vfixupimmps`](https://www.felixcloutier.com/x86/vfixupimmps) to handle cases like 0. – Peter Cordes Oct 02 '21 at 18:39
  • @PeterCordes, the functions are not in DLL; even though I compile with `/MD`, so that C run-time is in DLL, they are statically linked as exe. I might have used _indirect_ term incorrectly: there's function call is `e8 af 0c 00 00` then at the target address there's jump `e9 0b 00 00 00`, and at that address there's actual implementation. – Alex Guteniev Oct 02 '21 at 18:53
  • Ah ok, that's not exactly "indirect call" like you'd get for DLL dynamic linking. (Can that resolve at dynamic link time based on CPU features, the way Linux dynamic linking can? That's how glibc resolves memcmp etc. to versions for AVX2 CPUs or whatever without any per-call overhead above what dynamic linking already imposes.) `call rel32` to a `jmp rel32` is more like a Linux PLT stub / wrapper (although a PLT normally uses a `jmp [mem]`). IDK if SVML could rewrite that `jmp` based on CPU features; can you tell if it's in a memory page that it might remap read/write at some point? – Peter Cordes Oct 02 '21 at 18:57
  • Unless you had an actual Xeon Phi, you wouldn't have AVX512ER. No mainstream (Skylake / Ice Lake) CPUs have AVX512ER (exponential / reciprocal). I wonder how much of pain it would be to run your binary on a virtual KNL, like with Qemu or SDE... If it's going to do any dynamic dispatching on any CPU, I'd guess it might do it there. (Especially for a `__m512` version if there is one.) – Peter Cordes Oct 02 '21 at 18:58
  • I have an usual Coffee Lake laptop CPU i7-8750H. I've tried to compile with `/arch:AVX2`, and in 32-bit mode with `/arch:IA32`, no difference. I'm starting to suspect it doesn't do any dispatching by rewriting `jmp`, and it all just pointless... – Alex Guteniev Oct 02 '21 at 19:07
  • There's no reason why it would rewrite the `jmp` on that CPU. External functions get called with the CPU in clean-upper state so `vsqrtps` would have no advantage (unless something violated that). It's plausible it would rewrite it on a Xeon Phi, where `sqrtps` is much slower (18 uops, 38c latency). But could easily just be pointless, especially since it's a direct jmp, not `jmp [mem]` through a pointer which would pretty definitely imply dispatch. – Peter Cordes Oct 02 '21 at 19:10
  • 1
    Maybe the point is indeed doing a different encoding for Xeon Phi, but since Visual Studio does not attempt to target Xeon Phi, the SVML library version shipped with it has the dispatch omitted, but rudimental dispatch functions still present, so I see this relative jmp; I've added a screenshot to the answer – Alex Guteniev Oct 02 '21 at 19:15
  • That might be possible, yeah, if SVML's own source code had a dispatch function that optimized away to a tailcall to the SSE2 version, when build with some #define config options. And instead of inlining the sqrt/ret or giving both symbols the same address, we got an actual function that just compiled to a `jmp`. That's what compilers normally for non-inline function calls if you don't do something special; I guess they don't like letting 2 functions have the same address? Or it's a missed optimization that they emit a real function that tailcalls instead of using a symbol alias. – Peter Cordes Oct 02 '21 at 19:17
  • @AlexGuteniev Wow you really went all out. Thanks massively for your investigative work. – dave_thenerd Oct 03 '21 at 21:31
  • I recalled I have access to Intel Compiler 16.0. Tried it -- the result is different. Apparently MSVC version is some down-level. – Alex Guteniev Oct 04 '21 at 16:01
  • Just saw your last edit. Yeah, almost certainly a dispatch function that optimized down to an unconditional tailcall rather than a symbol alias. Still pointless for almost every use-case, though, except possibly on Xeon Phi. – Peter Cordes Oct 04 '21 at 16:02