0

I'm trying to load 4 packed floats into xmm0 register:

float *f=(float*)_aligned_malloc(16,16);
asm volatile
(
"movaps %0,%%xmm0"
:
:"r"(f)
:"%xmm0","memory"
);

But I get this error:

operand type mismatch for `movaps'

How can I fix it?

  • 1
    `movaps` into a clobbered register is completely pointless. Use an `"=x"` output constraint for a vector of float. If you were planning to write a separate `asm()` statement that assumed something would still be in `xmm0`, that's not safe because the compiler could have used xmm0 to copy 16 bytes, or for scalar math. – Peter Cordes Feb 23 '18 at 17:21

3 Answers3

4

You can just use an intrinsic, rather than trying to "re-invent the wheel":

#include <xmmintrin.h>

__m128 v = _mm_load_ps(f); // compiles to movaps
Paul R
  • 208,748
  • 37
  • 389
  • 560
  • I prefer not to use some of intrinsic functions because they don't work the way I expect. –  Feb 23 '18 at 14:30
  • 1
    Intrinsics seem to work perfectly well for most people - what specific difficulty did you encounter ? – Paul R Feb 23 '18 at 14:33
  • I tried to make use of all 16 xmm registers in my program which uses intrinsics, but the assembly output of the code shows that only 4 xmm registers are actually used. So I thought it's better to implement it by inline assembly instead. –  Feb 23 '18 at 14:39
  • 2
    I think you're making some bad assumptions - the compiler will generally do a pretty good job of assigning registers (usually better than a human) as it knows what the lifetime of each register is. If you don't believe me try writing a simple (but non-trivial) function with asm and the same function with intrinsics and then benchmark them - unless you're a very good asm programmer then the compiled intrinsics version will usually be faster. The intrinsics version will be be much more robust and portable too, as it's immune to ABI issues, and can easily be re-targetted for a different CPU. – Paul R Feb 23 '18 at 14:42
  • That's what I'm trying to do. Just writing the asm version of my intrinsics based program. But this error won't let me do it. –  Feb 23 '18 at 14:46
  • 1
    Well good luck with that - you probably won’t achieve much but you’ll learn a lot in the process... – Paul R Feb 23 '18 at 15:15
  • 2
    If they don't work the way you expect, then your expectations are the problem. The compiler won't use extra register for no reason; only if it needs to keep more values live at once. x86 is out-of-order with register renaming, so there are no write-after-read or write-after-write hazards; reusing the same register for something else is *not* a problem. See https://stackoverflow.com/questions/45113527/why-does-mulss-take-only-3-cycles-on-haswell-different-from-agners-instruction. – Peter Cordes Feb 23 '18 at 17:15
3

This just seems like a bad idea. If you want to write a whole block in asm, then do that, but don't try to build your own version of intrinsics using separate single-instruction asm blocks. It will not perform well, and you can't force register allocation between asm blocks.

You could maybe use stuff like __m128 foo asm("xmm2"); to have the compiler keep that C variable in xmm2, but that's not guaranteed to be respected except when used as an operand for an asm statement. The optimizer will still do its job.

I tried to make use of all 16 xmm registers in my program which uses intrinsics, but the assembly output of the code shows that only 4 xmm registers are actually used. So I thought it's better to implement it by inline assembly instead.

The compiler won't use extra register for no reason; only if it needs to keep more values live at once. x86 is out-of-order with register renaming, so there are no write-after-read or write-after-write hazards; reusing the same register for something independent is not a problem. See Why does mulss take only 3 cycles on Haswell, different from Agner's instruction tables?. A write-only access to a full register has no dependency on the old value of the register. (Merging with the old value does create a dependency though, like movss %xmm1, %xmm0, which is why yous should use movaps to copy registers even if you only care about the low element.)


Your template will assemble to something like movaps %rax, %xmm0, which of course doesn't work. movaps needs a memory source, not an integer register.

The best way is usually to tell the compiler about the memory operand, so you don't need a "memory" clobber or a separate "dummy" operand. (A pointer operand in a register doesn't imply that the pointed-to memory needs to be in sync).

But note that the memory operand has to have the right size, so the compiler knows you read 4 floats starting at that address. If you just used "m" (*f), it could still reorder your asm with an assignment to f[3]. (Yes, even with asm volatile, unless f[3] was also a volatile access.)

typedef float v4sf __attribute__((vector_size(16),may_alias));

// static inline    
v4sf my_load_ps(float *f) {
    v4sf my_vec;
    asm(
        "movaps %[input], %[output]"
        : [output] "=x" (my_vec)
        : [input] "m" (*(v4sf*)f)
        : // no clobbers
    );
    return my_vec;
}

(On Godbolt)

Using a memory operand lets the compiler pick the addressing mode, so it can still unroll loops if you use this inside a loop. e.g. adding f+=16 to this function results in

    movaps 64(%rdi), %xmm0
    ret

instead of add $64, %rdi / movaps (%rdi), %xmm0 like you'd get if you hard-coded the addressing mode. See Looping over arrays with inline assembly.


movaps into a clobbered register is completely pointless. Use an "=x" output constraint for a vector of float. If you were planning to write a separate asm() statement that assumed something would still be in xmm0, that's not safe because the compiler could have used xmm0 to copy 16 bytes, or for scalar math. asm volatile doesn't make that safe.

Hopefully you were planning to add more instructions to the same asm statement, though.

Peter Cordes
  • 328,167
  • 45
  • 605
  • 847
0

You need to place the pointer operand inside parentheses:

"movaps (%0),%%xmm0"