The description "store the results in dst
" is a little misleading. The intrinsic function returns the result of the vector addition as a value of type __m128d
.
__m128d arg1 = ...;
__m128d arg2 = ...;
__m128d result = _mm_add_pd(arg1, arg2);
If you call the variable dst
instead of result
, then you have code that fits the description. (But you can call it whatever you want.)
The underlying SSE instruction, ADDPD
, stores the result of the operation in the XMM register of its choice. The compiler will do register allocation (and even store/reload C vector variables if it runs out of registers, or around a function call that clobbers the vector registers).
Intrinsics operate on C variables, just like +
and *
with int
or float
types. Normally these compile to asm instructions that operate on registers (or maybe a memory source operand if it combines a load and add intrinsic), but leaving all this to the compiler is the point of using intrinsics.
You do want to write your code so that it can compile efficiently, though: if more than 16 __m128
variables are "alive" at once, the compiler will have to spill/reload them.