Where do SSE2 intrinsics store results?

Question

I'm moving the first steps into SSE2 in C++. Here's the intrinsic I'm learning right now:

__m128d _mm_add_pd (__m128d a, __m128d b)

The document says: Add packed double-precision (64-bit) floating-point elements in a and b, and store the results in dst.

But I never pass dst to that function. So how can it add two double I pass (via pointer) to a resulting array if I don't pass it?

Tobias Ribizel · Answer 1 · 2018-12-12T19:04:12.177

The intrinsic returns the result of the computation, so you can store it in a variable or use it as another parameter.

An important thing to note here is that most SIMD instructions don't operate directly on memory, but you need to explicitly load (_mm_load(u)_pd) and store (_mm_store(u)_pd) the double values as you would for example do in assembly. The intermediate values will most likely be stored in SSE registers, or if too many registers are in use, on the stack.

So if you wanted to sum up two double arrays, you would do something like

double a[N];
double b[N];
double c[N];
for (int i = 0; i < N; i += 2) {  // We load two doubles every time
    auto x = _mm_loadu_pd(a + i); // We don't know anything about alignment
    auto y = _mm_loadu_pd(b + i); // So I assume the load is unaligned
    auto sum = _mm_add_pd(x, y);  // Compute the vector sum
    _mm_storeu_pd(c + i, sum);    // The store is unaligned as well
}

score 5 · Accepted Answer · edited Dec 13 '18 at 04:05

5

The description "store the results in dst" is a little misleading. The intrinsic function returns the result of the vector addition as a value of type __m128d.

__m128d arg1 = ...;
__m128d arg2 = ...;
__m128d result = _mm_add_pd(arg1, arg2);

If you call the variable dst instead of result, then you have code that fits the description. (But you can call it whatever you want.)

The underlying SSE instruction, ADDPD, stores the result of the operation in the XMM register of its choice. The compiler will do register allocation (and even store/reload C vector variables if it runs out of registers, or around a function call that clobbers the vector registers).

Intrinsics operate on C variables, just like + and * with int or float types. Normally these compile to asm instructions that operate on registers (or maybe a memory source operand if it combines a load and add intrinsic), but leaving all this to the compiler is the point of using intrinsics.

You do want to write your code so that it can compile efficiently, though: if more than 16 __m128 variables are "alive" at once, the compiler will have to spill/reload them.

edited Dec 13 '18 at 04:05

Peter Cordes

328,167
45
605
847

answered Dec 12 '18 at 18:55

TypeIA

16,916
1
38
52

but `__m128d` isn't a sort of `typedef double` pointer? How can that code works if I don't allocate the `result` array? – markzzz Dec 12 '18 at 18:59
3

@markzzz It's not a pointer or a `typedef`, it's an opaque type which the compiler knows how to deal with. This is the abstraction I referred to. The compiler knows that the `ADDPD` output goes into `xmm1` and from there it will manage the value however it needs to based on what you've done with the variables in C. – TypeIA Dec 12 '18 at 19:04
So I don't understand what `__m128d` is. Compiler keyword so? GCC and MSVC do the same? C++ Standard? Any docs to it? – markzzz Dec 12 '18 at 19:07
Not a compiler keyword and not part of the C or C++ standards. It is an intrinsic type, defined by Intel, which is supported by [MSVC](https://learn.microsoft.com/en-us/cpp/cpp/m128d?view=vs-2017), [gcc](https://gcc.gnu.org/onlinedocs/gcc/x86-Options.html), [Intel's compiler](https://software.intel.com/en-us/intel-compilers), [clang](https://clang.llvm.org/docs/genindex.html) and likely others. – TypeIA Dec 12 '18 at 19:11
1

Not exactly related to the question, just for precision: the ADDPD instruction can store its result into any xmm register not just xmm1. The use of xmm1 in the linked docs is a little misleading. – rpress Dec 12 '18 at 19:33
"The __m128d data type, for use with the Streaming SIMD Extensions 2 instructions intrinsics, is defined in ." So its a data type :O Can access to its definition though :O – markzzz Dec 12 '18 at 19:40
Every compiler has its own version of the intrinsics headers, so their definitions may differ. If you only use the `__m*` types with the corresponding intrinsics, you can be rather sure that the compiler compiles everything as you specified it. Any access to the 'internals' of the data type will probably have to use rather inefficient instructions compared to what you can do with the `_mm*` intrinsics. So in short: Unless you know exactly what you are doing, best stick to the intrinsics. – Tobias Ribizel Dec 12 '18 at 19:54
1

Physically ``__m128d`` as a stack variable is just a proxy for a ``XMM`` hardware register. The compiler is responsible for scheduling the instructions and keeping tracking of the register file, so you don't actually know or care what register it gets mapped to. For x64 you have more ``XMM`` than x86 available. Intel and MSVC also consider ``__m128d`` to be a structure so you can use it to cast 16-byte aligned memory to the same 'type', but GCC/clang do not support this and treat ``__m128d`` as an opaque type. See also [DirectXMath](https://github.com/Microsoft/DirectXMath). – Chuck Walbourn Dec 12 '18 at 20:32
@rpress: fun fact: there are a few SIMD instructions that do use a fixed architectural register, like non-VEX `blendvps` / `pblendvb` implicitly uses XMM0 as the blend-control. Intel documents this with `BLENDVPS xmm1, xmm2/m128, `: http://felixcloutier.com/x86/BLENDVPS.html. (I edited the answer to remove the mistake you pointed out.) – Peter Cordes Dec 13 '18 at 04:08

Where do SSE2 intrinsics store results?

2 Answers2