I'm trying to load 4 packed floats into xmm0
register:
float *f=(float*)_aligned_malloc(16,16);
asm volatile
(
"movaps %0,%%xmm0"
:
:"r"(f)
:"%xmm0","memory"
);
But I get this error:
operand type mismatch for `movaps'
How can I fix it?
I'm trying to load 4 packed floats into xmm0
register:
float *f=(float*)_aligned_malloc(16,16);
asm volatile
(
"movaps %0,%%xmm0"
:
:"r"(f)
:"%xmm0","memory"
);
But I get this error:
operand type mismatch for `movaps'
How can I fix it?
You can just use an intrinsic, rather than trying to "re-invent the wheel":
#include <xmmintrin.h>
__m128 v = _mm_load_ps(f); // compiles to movaps
This just seems like a bad idea. If you want to write a whole block in asm, then do that, but don't try to build your own version of intrinsics using separate single-instruction asm
blocks. It will not perform well, and you can't force register allocation between asm
blocks.
You could maybe use stuff like __m128 foo asm("xmm2");
to have the compiler keep that C variable in xmm2
, but that's not guaranteed to be respected except when used as an operand for an asm
statement. The optimizer will still do its job.
I tried to make use of all 16 xmm registers in my program which uses intrinsics, but the assembly output of the code shows that only 4 xmm registers are actually used. So I thought it's better to implement it by inline assembly instead.
The compiler won't use extra register for no reason; only if it needs to keep more values live at once. x86 is out-of-order with register renaming, so there are no write-after-read or write-after-write hazards; reusing the same register for something independent is not a problem. See Why does mulss take only 3 cycles on Haswell, different from Agner's instruction tables?. A write-only access to a full register has no dependency on the old value of the register. (Merging with the old value does create a dependency though, like movss %xmm1, %xmm0
, which is why yous should use movaps
to copy registers even if you only care about the low element.)
Your template will assemble to something like movaps %rax, %xmm0
, which of course doesn't work. movaps
needs a memory source, not an integer register.
The best way is usually to tell the compiler about the memory operand, so you don't need a "memory"
clobber or a separate "dummy" operand. (A pointer operand in a register doesn't imply that the pointed-to memory needs to be in sync).
But note that the memory operand has to have the right size, so the compiler knows you read 4 floats starting at that address. If you just used "m" (*f)
, it could still reorder your asm with an assignment to f[3]
. (Yes, even with asm volatile
, unless f[3]
was also a volatile
access.)
typedef float v4sf __attribute__((vector_size(16),may_alias));
// static inline
v4sf my_load_ps(float *f) {
v4sf my_vec;
asm(
"movaps %[input], %[output]"
: [output] "=x" (my_vec)
: [input] "m" (*(v4sf*)f)
: // no clobbers
);
return my_vec;
}
Using a memory operand lets the compiler pick the addressing mode, so it can still unroll loops if you use this inside a loop. e.g. adding f+=16
to this function results in
movaps 64(%rdi), %xmm0
ret
instead of add $64, %rdi
/ movaps (%rdi), %xmm0
like you'd get if you hard-coded the addressing mode. See Looping over arrays with inline assembly.
movaps
into a clobbered register is completely pointless. Use an "=x"
output constraint for a vector of float. If you were planning to write a separate asm()
statement that assumed something would still be in xmm0
, that's not safe because the compiler could have used xmm0
to copy 16 bytes, or for scalar math. asm volatile
doesn't make that safe.
Hopefully you were planning to add more instructions to the same asm
statement, though.
You need to place the pointer operand inside parentheses:
"movaps (%0),%%xmm0"