Should I use SIMD or vector extensions or something else?

Question

I'm currently develop an open source 3D application framework in c++ (with c++11). My own math library is designed like the XNA math library, also with SIMD in mind. But currently it is not really fast, and it has problems with memory alignes, but more about that in a different question.

Some days ago I asked myself why I should write my own SSE code. The compiler is also able to generate high optimized code when optimization is on. I can also use the "vector extension" of GCC. But this all is not really portable.

I know that I have more control when I use my own SSE code, but often this control is unnessary.

One big problem of SSE is the use of dynamic memory which is, with the help of memory pools and data oriented design, as much as possible limited.

Now to my question:

Should I use naked SSE? Perhaps encapsulated.

__m128 v1 = _mm_set_ps(0.5f, 2, 4, 0.25f);
__m128 v2 = _mm_set_ps(2, 0.5f, 0.25f, 4);

__m128 res = _mm_mul_ps(v1, v2);

Or should the compiler do the dirty work?

float v1 = {0.5f, 2, 4, 0.25f};
float v2 = {2, 0.5f, 0.25f, 4};

float res[4];
res[0] = v1[0]*v2[0];
res[1] = v1[1]*v2[1];
res[2] = v1[2]*v2[2];
res[3] = v1[3]*v2[3];

Or should I use SIMD with additional code? Like a dynamic container class with SIMD operations, which needs additional load and store instructions.
```
Pear3D::Vector4f* v1 = new Pear3D::Vector4f(0.5f, 2, 4, 0.25f);
Pear3D::Vector4f* v2 = new Pear3D::Vector4f(2, 0.5f, 0.25f, 4);

Pear3D::Vector4f res = Pear3D::Vector::multiplyElements(*v1, *v2);
```
The above example use a imaginary class with uses float[4] internal and uses store and load in each methods like multiplyElements(...). The methods uses SSE internal.

I don't want to use another library, because I want to learn more about SIMD and large scale software design. But library examples are welcome.

PS: This is not a real problem more a design question.

Why not be lazy and let the compiler do the optimizations, if it is possible? — Vlad, May 23 '12 at 11:14
This is the question, why I should not let the compiler do the dirty work? I read many c++ and design books and the most one prefer a SSE implementation. — pearcoding, May 23 '12 at 11:16
Well, definitely not the 3rd one, at least not with dynamically allocated memory for something that small as vec4 (this is C++ and not Java). You may encapsulate the `__m128` into a class to propagate its alignment restrictions (of course you have to take care of dynamic allocation by overloading `operator new` and specializing `std::allocator`), but don't ever use dynamic memory allocation for something that simple as a single vec4. This will outweight any possible gain from SSE by a factor of two billion (exaggeration intended). — Christian Rau, May 23 '12 at 11:23
Not sure how it looks right now, but two years ago using intrinsics was the way to do it. Maybe have a look at OpenCL or CUDA. At leas theoretically with OpenCL you should be able to generate fast GPU and CPU code which uses vector extensions. — Nils, May 23 '12 at 11:23
I currently use the first one with encapsulation, and I also want to use OpenCL, but only for big work, like Deferred Rendering, not for little things like vector multiplication :) — pearcoding, May 23 '12 at 11:25
Ic but there is probably already a math vector lib with vector-extension support, as there is https://github.com/ridiculousfish/libdivide for int division — Nils, May 23 '12 at 11:31
@Vlad The compiler doesn't always do it properly. That is when you need to use sseX functions — BЈовић, May 23 '12 at 11:34
I know that there are alread libraries like [AMD's SSEPlus](http://developer.amd.com/LIBRARIES/Pages/default.aspx) or [eigen](http://eigen.tuxfamily.org/index.php?title=Main_Page) but I want to make my own one... — pearcoding, May 23 '12 at 11:35
Why on earth would you ever use `new` and dynamic allocation? — Puppy, May 23 '12 at 14:24
The codes are only an example... Who on earth will use dynamic memory with vectors, which are multiplied together in the next step :D Also I could write Pear3D:Vector4f(1.0f, 1.0f, 1.0f, 1.0f)... but the example should show only the used techniques :D — pearcoding, May 23 '12 at 15:41
my experience is: the compilers cannot be relied upon, in particular they cannot change your data layout (with its immediate implications on its usability with SSE instructions), so you must design your application carefully and avoid un-aligned data. I prefer method 1, but possibly encapsulated properly. — Walter, May 23 '12 at 15:51
I will go further to say that current compilers are almost useless for auto-vectorization. They fail to vectorize many things. And whatever they *can* vectorize tend to be very simple loops which are likely memory bound. Part of the problem is that they don't have the "big picture" information to do the necessary transformations for vectorization. — Mysticial, May 23 '12 at 18:21
Current compilers are poor at automatically vectorizing, but on the other hand, SSE is rarely worth it, too. SSE is extremely poor (and extremely mis-designed) for most everything except crunching huge homogenous SoA datasets (so unless you write something like a video codec, forget it). For the typical stuff like calculating a dot product on AoS data like you'll have it in your 3D Framework, it buys you close to nothing. — Damon, Jun 01 '12 at 22:49

Christian Rau · Accepted Answer · 2012-06-05T14:25:54.413

Well, if you want to use SIMD extensions, a good approach is to use SSE intrinsics (of course stay by all means away from inline assembly, but fortunately you didn't list it as alternative, anyway). But for cleanliness you should encapsulate them in a nice vector class with overloaded operators:

struct aligned_storage
{
    //overload new and delete for 16-byte alignment
};

class vec4 : public aligned_storage
{
public:
    vec4(float x, float y, float z, float w)
    {
         data_[0] = x; ... data_[3] = w; //don't use _mm_set_ps, it will do the same, followed by a _mm_load_ps, which is unneccessary
    }
    vec4(float *data)
    {
         data_[0] = data[0]; ... data_[3] = data[3]; //don't use _mm_loadu_ps, unaligned just doesn't pay
    }
    vec4(const vec4 &rhs)
        : xmm_(rhs.xmm_)
    {
    }
    ...
    vec4& operator*=(const vec4 v)
    {
         xmm_ = _mm_mul_ps(xmm_, v.xmm_);
         return *this;
    }
    ...

private:
    union
    {
        __m128 xmm_;
        float data_[4];
    };
};

Now the nice thing is, due to the anonymous union (UB, I know, but show me a platform with SSE where this doesn't work) you can use the standard float array whenever neccessary (like operator[] or initialization (don't use _mm_set_ps)) and only use SSE when appropriate. With a modern inlining compiler the encapsulation comes at probably no cost (I was rather surprised how well VC10 optimized the SSE instructions for a bunch of computations with this vector class, no fear of unneccessary moves into temporary memory variables, as VC8 seemed to like even without encapsulation).

The only disadvantage is, that you need to take care of proper alignment, as unaligned vectors don't buy you anything and may even be slower than non-SSE. But fortunately the alignment requirement of the __m128 will propagate into the vec4 (and any surrounding class) and you just need to take care of dynamic allocation, which C++ has good means for. You just need to make a base class whose operator new and operator delete functions (in all flavours of course) are overloaded properly and from which your vector class will derive. To use your type with standard containers you of course also need to specialize std::allocator (and maybe std::get_temporary_buffer and std::return_temporary_buffer for the sake of completeness), as it will use the global operator new otherwise.

But the real disadvantage is, that you need to also care for the dynamic allocation of any class that has your SSE vector as member, which may be tedious, but can again be automated a bit by also deriving those classes from aligned_storage and putting the whole std::allocator specialization mess into a handy macro.

JamesWynn has a point that those operations often come together in some special heavy computation blocks (like texture filtering or vertex transformation), but on the other hand using those SSE vector encapsulations doesn't introduce any overhead over a standard float[4]-implementation of a vector class. You need to get those values from memory into registers anyway (be it the x87 stack or a scalar SSE register) in order to do any computations, so why not take em all at once (which should IMHO not be any slower than moving a single value if properly aligned) and compute in parallel. Thus you can freely switch out an SSE-inplementation for a non-SSE one without inducing any overhead (correct me if my reasoning is wrong).

But if the ensuring of alignment for all classes having vec4 as member is too tedious for you (which is IMHO the only disadvantage of this approach), you can also define a specialized SSE-vector type which you use for computations and use a standard non-SSE vector for storage.

EDIT: Ok, to look at the overhead argument, that goes around here (and looks quite reasonable at first), let's take a bunch of computations, which look very clean, due to overloaded operators:

#include "vec.h"
#include <iostream>

int main(int argc, char *argv[])
{
    math::vec<float,4> u, v, w = u + v;
    u = v + dot(v, w) * w;
    v = abs(u-w);
    u = 3.0f * w + v;
    w = -w * (u+v);
    v = min(u, w) + length(u) * w;
    std::cout << v << std::endl;
    return 0;
}

and see what VC10 thinks about it:

...
; 6   :     math::vec<float,4> u, v, w = u + v;

movaps  xmm4, XMMWORD PTR _v$[esp+32]

; 7   :     u = v + dot(v, w) * w;
; 8   :     v = abs(u-w);

movaps  xmm3, XMMWORD PTR __xmm@0
movaps  xmm1, xmm4
addps   xmm1, XMMWORD PTR _u$[esp+32]
movaps  xmm0, xmm4
mulps   xmm0, xmm1
haddps  xmm0, xmm0
haddps  xmm0, xmm0
shufps  xmm0, xmm0, 0
mulps   xmm0, xmm1
addps   xmm0, xmm4
subps   xmm0, xmm1
movaps  xmm2, xmm3

; 9   :     u = 3.0f * w + v;
; 10   :    w = -w * (u+v);

xorps   xmm3, xmm1
andnps  xmm2, xmm0
movaps  xmm0, XMMWORD PTR __xmm@1
mulps   xmm0, xmm1
addps   xmm0, xmm2

; 11   :    v = min(u, w) + length(u) * w;

movaps  xmm1, xmm0
mulps   xmm1, xmm0
haddps  xmm1, xmm1
haddps  xmm1, xmm1
sqrtss  xmm1, xmm1
addps   xmm2, xmm0
mulps   xmm3, xmm2
shufps  xmm1, xmm1, 0

; 12   :    std::cout << v << std::endl;

mov edi, DWORD PTR __imp_?cout@std@@3V?$basic_ostream@DU?$char_traits@D@std@@@1@A
mulps   xmm1, xmm3
minps   xmm0, xmm3
addps   xmm1, xmm0
movaps  XMMWORD PTR _v$[esp+32], xmm1
...

Even without thoroughly analyzing every single instruction and its use, I'm pretty confident to say that there aren't any unneccessary loads or stores, except the ones at the beginning (Ok, I left them uninitialized), which are neccessary anyway to get them from memory into computing registers, and at the end, which is neccessary as in the following expression v is gonna be put out. It didn't even store anything back into u and w, since they are only temporary variables which I don't use any further. Everything is perfectly inlined and optimized out. It even managed to seamlessly shuffle the result of the dot product for the following multiplication, without it leaving the XMM register, although the dot function returns a float using an actual _mm_store_ss after the haddpss.

So even I, being usually a bit oversuspicios of the compiler's abilities, have to say, that handcrafting your own intrinsics into special functions doesn't really pay compared to the clean and expressive code you gain by encapsulation. Though you may be able to create killer examples where handrafting the intrinics may indeed spare you some few instructions, but then again you first have to outsmart the optimizer.

EDIT: Ok, Ben Voigt pointed out another problem of the union besides the (most probably not problematic) memory layout incompatibility, which is that it is violating strict aliasing rules and the compiler may optimize instructions accessing different union members in a way that makes the code invalid. I haven't thought about that yet. I don't know if it makes any problems in practice, it certainly needs investigation.

If it really is a problem, we unfortunately need to drop the data_[4] member and use the __m128 alone. For initialization we now have to resort to _mm_set_ps and _mm_loadu_ps again. The operator[] gets a bit more complicated and might need some combination of _mm_shuffle_ps and _mm_store_ss. But for the non-const version you have to use some kind of proxy object delegating an assignment to the corresponding SSE instructions. It has to be investigated in which way the compiler can optimize this additional overhead in the specific situations then.

Or you only use the SSE-vector for computations and just make an interface for conversion to and from non-SSE vectors at a whole, which is then used at the peripherals of computations (as you often don't need to access individual components inside lengthy computations). This seems to be the way glm handles this issue. But I'm not sure how Eigen handles it.

But however you tackle it, there is still no need to handcraft SSE instrisics without using the benefits of operator overloading.

+1 thanks for the answer. I already use intrinsics but don't use real encapsulation class with overloaded operators. In one book (I don't really remember, maybe [Real Time Rendering](http://www.realtimerendering.com/) by Tomas Akenine-Möller, Eric Haines, and Naty Hoffman) this was not really suggested because of the much overhead. I think I should redesign the application and center many math-heavy algorithms in one point :) — pearcoding, Jun 01 '12 at 13:40
@omercan1993 Check it out. Make some of those vectors and do a bunch of computations with them. I would be surprised if it wouldn't result in just a bunch of SSE instructions without any unneccessary loads or stores (let aside function calls). I don't think a recent enough gcc is in any way worse than VC10 in this regard. Of course at the start of those computation blocks you probably have some loads, but those are there for non-SSE vectors or hand-written intrinsics anyway. — Christian Rau, Jun 01 '12 at 13:43
You're violating strict aliasing here... can you provide any documentation that allows this particular case? The problem isn't that the memory layout may be incompatible (also it could be on some unusual platform), it's that the compiler is free to optimize each field of the union separately and perform reordering that breaks your code. — Ben Voigt, Jun 01 '12 at 19:04

Ben Voigt · Answer 2 · 2012-06-01T13:44:58.267

4

I suggest that you learn about expression templates (custom operator implementations that use proxy objects). In this way, you can avoid doing performance-killing load/store around each individual operation, and do them only once for the entire computation.

edited Jun 01 '12 at 13:44

answered Jun 01 '12 at 13:36

Ben Voigt

277,958
43
419
720

+1 Something like [this](http://stackoverflow.com/questions/4527394/template-trick-to-optimize-out-allocations)? Thanks I will look at them. – pearcoding Jun 01 '12 at 13:42
1

@omercan1993: Yes, that's exactly what I'm talking about. – Ben Voigt Jun 01 '12 at 13:44
1

Well, the compiler (my VC10 at least) is actually quite good at inlining and optimizing those computations to the bare SSE instructions without incurring unneccessary loads and stores (see my answer for a small (and maybe simple, but IMHO rather common) example). Though it may still be possible to create some killing example computations (for which you need a very good ET implementation, too). – Christian Rau Jun 01 '12 at 16:43
@ChristianRau: I don't know whether your code works in practice or not, but you're well into the land of undefined behavior by using a union to setup your MMX variables instead of the load and store intrinsics. – Ben Voigt Jun 01 '12 at 19:03
@BenVoigt Yes, I know it's UB, but c'mon, we're deep in the bowels of hardware with SSE. Just show me an SSE-enabled platforms where this union-trick doesn't work. Of course you're on the safe side of platform-independence when completely dropping SSE anyway, but well. – Christian Rau Jun 01 '12 at 20:07
@Christian: Say you read `data_[0]` and store it into an integer. Then do some MMX operations, and convert `data_[0]` to an integer again. Because you haven't written to any `float` variable between the two reads, the optimizer is perfectly Standard-compliant if it reuses the result of the original conversion, instead of generating a second access to `data_[0]` and converting the new value. – Ben Voigt Jun 01 '12 at 21:04
@BenVoigt Ok, haven't thought about aliasing yet, I have to admit. Needs further investigation. But ok, the `data_` alias is usually only used at the peripherals anyway and could be removed. For setting one may need to retreat to `_mm_set_ps`, though getting is more problematic, probably a `_mm_shuffle_ps` followed by an `_mm_store_ss`. But the fast computation stays. – Christian Rau Jun 01 '12 at 21:25

score 2 · Answer 3 · answered May 23 '12 at 11:40

2

I would suggest using the naked simd code in a tightly controlled function. Since you won't be using it for your primary vector multiplication because of the overhead, this function should probably take the list of Vector3 objects that need to be manipulated, as per DOD. Where there's one, there is many.

answered May 23 '12 at 11:40

James Wynn

686
6
6

+1 So I can use a function like pear3d_multiply3fv(float* vs, int count) :) But it is very hard to center many multiplications on one point... well I should rethink the design but with DOD it shouldn't be very hard. – pearcoding May 23 '12 at 11:44
JamesWynn There isn't really any ovrhead when doing all computations with the SSE-implemented vector (see my answer), so I don't think moving everything into a seperate function with handcrafted intrinsics really pays. But you're right in that the major use of SSE code is indeed in some special compuation-heavy blocks. – Christian Rau Jun 01 '12 at 17:02

Should I use SIMD or vector extensions or something else?

3 Answers3

Linked