Tiny SSE addpd loop slightly slower than scalar on AMD Phenom II?

Question

Yes, I read SIMD code runs slower than scalar code. No, it's not really a duplicate.

I have been using 2D math stuff for a while, and in the process of porting my codebase from C to C++. There are a few walls I've hit with C that mean I really need polymorphism, but that's another story. Anyway, I considered this a while ago, but it presented a perfect opportunity to use a 2D vector class, including SSE implementations of the common math operations. Yes, I know there are libraries out there, but I wanted to try it myself to understand what's going on, and I don't use anything more complicated than +=.

My implementation is via <immintrin.h>, with a

union {
    __m128d ss;
    struct {
        double x;
        double y;
    }
}

SSE seemed slow, so I looked at its generated ASM output. After fixing something stupid pointerwise, I ended up with the following sets of instructions, run a billion times in a loop: (Processor is an AMD Phenom II at 3.7GHz)

SSE enabled: 1.1 to 1.8 seconds (varies)

add      $0x1, %eax
addpd    %xmm0, %xmm1
cmp      $0x3b9aca00, %eax
jne      4006c8

SSE disabled: 1.0 seconds (pretty constant)

add      $0x1, %eax
addsd    %xmm0, %xmm3
cmp      $0x3b9aca00, %eax
addsd    %xmm2, %xmm1
jne      400630

The only conclusion I can draw from this is that addsd is faster than addpd, and that pipelining means that the extra instruction is compensated for by the ability to do more faster things partially overlapping.

So my question is: is this worth it, and in practice will it actually help, or should I just not bother with the stupid optimization and let the compiler handle it in scalar mode?

offtopic: run this calculations on GPU and it become much faster.. — Evgen Bodunov, Jun 21 '12 at 20:49
Tried that, unfortunately I cannot find a way to parallelize a loop of the form `sums[(int)(list[i].index)]+=list[i].data` for array sizes too large to fit in GPU thread memory. If I can't do it 100% on the GPU, the repeated memcopy-to-GPU destroys any performance I might have had. — zebediah49, Jun 21 '12 at 20:55
Most AMD CPUs only have a 64 bit execution unit for SSE, so 128 bit SSE instructions actually require two micro-ops (i.e. 2 clock cycles). This used to be true on Intel CPUs until about 5 years ago too. Try running your code on a modern Intel CPU, e.g. Core i7 and see if that helps. — Paul R, Jun 21 '12 at 21:06
@PaulR Phenom II is a K10, it doesn't split up 128bit instructions in half (except memory write, which is still split). This was my first thought too, but it must be something else. — harold, Jun 21 '12 at 23:07
@EvgenBodunov dubious. Regular computation on GPU only wins if the computation is big enough to cover the data transfer, which is I don't think the case here. SIMD+OpenMP is still more efficient than GPU on simpel cases liek this one. — Joel Falcou, Jun 22 '12 at 07:02
@EvgenBodunov i was takign for granted it was merely a +. But ofc, we need the code. — Joel Falcou, Jun 22 '12 at 10:49

score 7 · Accepted Answer · answered Jun 21 '12 at 20:43

7

This require more loop unrolling and maybe cache prefetching. Your arithmetic density is very low : 1 operation for 2 memory operations so you need to jam as much of these in your pipeline as possible.

Also don't use union but __m128d directly and use _mm_load_pd to fill your __m128 from your data. _m128 in union generate bad code where all element are doing a stack-register-stack dance which is detrimental.

answered Jun 21 '12 at 20:43

Joel Falcou

6,247
1
17
34

In the actual implementation, there are enough of these things that it needs to be allocated on heap anyway. None the less, that is a good point. – zebediah49 Jun 21 '12 at 20:57
1

Usually you don't allocate __m128* types. You allocate the underlying scalar types and _mm_load_* them. – Joel Falcou Jun 22 '12 at 10:49

score 2 · Answer 2 · answered Dec 24 '15 at 12:21

Just for the record, Agner Fog's instruction tables confirm that K10 runs addpd and addsd with identical performance: 1 m-op for the FADD unit, with 4 cycle latency. The earlier K8 did only have 64bit execution units, and split addpd into two m-ops.

So both loops have a 4 cycle loop-carried dependency chain. The scalar loop has two separate 4c dep chains, but that still only keeps the FADD unit occupied half the time (instead of 1/4).

Other parts of the pipeline must be coming into play, perhaps code alignment or just instruction ordering. AMD is more sensitive to that than Intel, IIRC. I'm not curious enough to read up on the K10 pipeline and figure out if there's an explanation in Agner Fog's docs.

K10 doesn't fuse cmp/jcc into a single op, so having them split up isn't actually a problem. (Bulldozer-family CPUs do, and of course Intel does).

score 1 · Answer 3 · answered Jun 21 '12 at 20:21

2D math isn't that processor intensive (compared to 3D math) so I highly doubt it's worth sinking that much time into it. It's worth optimizing if

Your profiler says the code is a hot spot.
Your code is running slowly. (I imagine this is for a game?)
You've already optimized the high-level algorithms.

I've done some SSE tests on my rigs (AMD APU @ 3GHz x 4; old Intel CPU @ 1.8Ghz x 2) and have found SSE to be of benefit in most of the cases I've tested. However, this was for 3D operations, not 2D.

The scalar code has more of an opportunity for parallelism iirc. Four registers used instead of two; less dependencies. If register contention becomes greater, the vectorized code may run better. Take that with a grain of salt though, I haven't put that to the test.

It's for scientific simulation work--this is about 97% of my time. (For example, two lines that are a written out matrix-multiplication are around 10% of my total run-time; `floor` is another 15%.) — zebediah49, Jun 21 '12 at 20:31
Ah, I see. In addition to Joel's suggestion, if you have not considered doing so already, you may like to multithread this to parallelize the workflow. — NotKyon, Jun 22 '12 at 09:07

Tiny SSE addpd loop slightly slower than scalar on AMD Phenom II?

3 Answers3