Is there any sort of difference in precision or performance between normal sqrtps/pd or the SVML version:
__m128d _mm_sqrt_pd (__m128d a) [SSE2]
__m128d _mm_svml_sqrt_pd (__m128d a) [SSE?]
__m128 _mm_sqrt_ps (__m128 a) [SSE]
__m128 _mm_svml_sqrt_ps (__m128 a) [SSE?]
I know that SVML Intrinsics like _mm_sin_ps
are actually functions consisting of potentially multiple asm instructions, thus they should be slower than any single multiply or even divide. However, I'm curious as to why these function exist if there are hardware-level Intrinsics available.
Were these SVML functions created before SSE2? Or is there a difference in precision?