operand type mismatch for `lddqu' with an __int128 "=r" destination

Question

I need to move 128 bit value from the address [rsi - 0x80] to the dest variable below using instruction lddqu, and I am encountering the error "operand type mismatch for `lddqu'". I know there are previous questions on stackoverflow using lower operand sizes but what suffix should I use with the instruction to be able to get the value at that address in the variable.

 __int128 dst = 0, src = 0;
asm volatile ("lddqu -0x80(%%rsi), %0\n\t"
        : "=r" (dst)
        : "r" (src));

Just to give an overview of the entire problem, this is only one instruction that is part of a larger graph algorithm that finds shortest path between two vertices. src variable is redundant and can be removed if it adds ambiguity. I am designing a hardware prefetcher (in a processor simulator) to predict future memory addresses based on the currently accessed addresses. Once I can get an address in a variable like dst, I have a technique that automatically predicts the future address and triggers the memory request for that address.

A larger version of the pattern is a sequence of loads and store, and looks like this:

  lddqu  xmm0,[rsi-0x80]
  movdqu XMMWORD PTR [rdi-0x80],xmm0
  lddqu  xmm0,[rsi-0x70]
  movdqu XMMWORD PTR [rdi-0x70],xmm0
  lddqu  xmm0,[rsi-0x60]
  movdqu XMMWORD PTR [rdi-0x60],xmm0
  lddqu  xmm0,[rsi-0x50]
  movdqu XMMWORD PTR [rdi-0x50],xmm0

Now, I am working on how to get the inline asm working with Intel syntax.

You should be using intrinsics for this, especially if you don't already know how to use GNU C inline asm. Like `__m128i tmp = _mm_loadu_si128( (const __m128i*)&src );` then you can `_mm_storeu_si128` into dst if you want. https://stackoverflow.com/tags/sse/info. Or obviously just `dst = src` and let the compiler do its job. — Peter Cordes, Dec 30 '21 at 19:28
re: your edit: compilers generally don't need help using XMM registers to copy 16-byte objects around. Or even using YMM registers to copy them in pairs. If you're worried about `lddqu` for more efficient unaligned loads, that only ever mattered on P4 ([A faster integer SSE unalligned load that's rarely used](https://stackoverflow.com/q/38370622)). On all other x86 CPUs, `lddqu` and `movdqu` decode to the same internal uops. Seems like you really just want `memcpy` or `memmove` if you have some contiguous copying to do. — Peter Cordes, Dec 31 '21 at 09:47
Re: Intel syntax in inline asm: [How to set gcc to use intel syntax permanently?](https://stackoverflow.com/q/38953951) — Peter Cordes, Dec 31 '21 at 09:47

score 3 · Accepted Answer · edited Dec 30 '21 at 19:35

lddqu can only load into a vector register, not a general-purpose register. Use =x in place of =r for dst's constraint.

Also, your source looks suspicious, since you're ignoring src and just loading from an arbitrary offset of a register that you know nothing about the content of.

Look at the compiler-generated asm around your asm statement to see how the compiler gets __int128 dst back into memory or integer registers after you force it to be in an XMM register, for example on https://godbolt.org/, especially with -O2 optimization enabled.

Using inline asm like this is probably even worse for efficiency than using SSE intrinsics like _mm_loadu_si128 - See also https://gcc.gnu.org/wiki/DontUseInlineAsm

Using =x solved the issue. I have added a broader context if that can be helpful for any future reader. — user7807498, Dec 31 '21 at 09:43

operand type mismatch for `lddqu' with an __int128 "=r" destination

1 Answers1