Yes, I read SIMD code runs slower than scalar code. No, it's not really a duplicate.
I have been using 2D math stuff for a while, and in the process of porting my codebase from C to C++. There are a few walls I've hit with C that mean I really need polymorphism, but that's another story. Anyway, I considered this a while ago, but it presented a perfect opportunity to use a 2D vector class, including SSE implementations of the common math operations. Yes, I know there are libraries out there, but I wanted to try it myself to understand what's going on, and I don't use anything more complicated than +=
.
My implementation is via <immintrin.h>
, with a
union {
__m128d ss;
struct {
double x;
double y;
}
}
SSE seemed slow, so I looked at its generated ASM output. After fixing something stupid pointerwise, I ended up with the following sets of instructions, run a billion times in a loop: (Processor is an AMD Phenom II at 3.7GHz)
SSE enabled: 1.1 to 1.8 seconds (varies)
add $0x1, %eax
addpd %xmm0, %xmm1
cmp $0x3b9aca00, %eax
jne 4006c8
SSE disabled: 1.0 seconds (pretty constant)
add $0x1, %eax
addsd %xmm0, %xmm3
cmp $0x3b9aca00, %eax
addsd %xmm2, %xmm1
jne 400630
The only conclusion I can draw from this is that addsd
is faster than addpd
, and that pipelining means that the extra instruction is compensated for by the ability to do more faster things partially overlapping.
So my question is: is this worth it, and in practice will it actually help, or should I just not bother with the stupid optimization and let the compiler handle it in scalar mode?