1

as in the title - I want to do as below:

__m128i_u* avxVar = (__m128i_u*)Var;  // Var allocated with alloc
*avxVar = _mm_set_epi64(...);         // is that ok to assign __m128i to __m128i_u ?
Peter Cordes
  • 328,167
  • 45
  • 605
  • 847
vela18
  • 37
  • 5

1 Answers1

2

Yes, but note that __m128i_u is not portable (e.g. to MSVC); it's what GCC/clang use internally to implement unaligned loadu/storeu intrinsics. It's exactly equivalent to do it the normal way:

_mm_storeu_si128((__m128i*)Var, vec);

(where vec is any __m128i. e.g. it could be _mm_set_epi64x or a variable.)

GCC 11's emmintrin.h implementation of _mm_storeu_si128 is defined like this, taking a __m128i_u* pointer arg, so the dereference does an unaligned access (if not optimized away).

// GCC internals, quoted for reference only.
// Just use #include <immintrin.h> in your own code
extern __inline void __attribute__((__gnu_inline__, __always_inline__, __artificial__))
_mm_storeu_si128 (__m128i_u *__P, __m128i __B)
{
  *__P = __B;
}

So yes, GCC's headers depend on __m128i* and __m128i_u* being compatible and implicitly convertible.

As much as _mm_storeu_si128 is an intrinsic for movdqu, so is a __m128i_u* dereference. But really these intrinsics just exist to communicate alignment information to the compiler, and it's up to the compiler to decide when to actually load and store, just like with deref of char*.

(Fun fact: __m128i* is a may_alias type, like char*, so you can point it at anything without violating strict-aliasing. Is `reinterpret_cast`ing between hardware SIMD vector pointer and the corresponding type an undefined behavior?)


Also note that _mm_set_epi64 takes __m64 args: it was for building an SSE2 vector from two MMX vectors, not from scalar int64_t. You probably want _mm_set_epi64x


They compile identically

void foo(void *Var) {
    __m128i_u* avxVar = (__m128i_u*)Var;
    *avxVar = _mm_set_epi64x(1, 2); 
}

void bar(void *Var) {
    _mm_storeu_si128((__m128i*)Var, _mm_set_epi64x(1, 2) );
}

Both functions compile identically (and are semantically equivalent so will always be the same after inlining) across gcc/clang/MSVC. But only the 2nd one compiles at all with MSVC, as you can see on the Godbolt compiler explorer: https://godbolt.org/z/Y8Wq96Pqs . if you disable the #ifdef __GNUC__, you get compiler errors on MSVC.

## GCC -O3
foo:
        movdqa  xmm0, XMMWORD PTR .LC0[rip]
        movups  XMMWORD PTR [rdi], xmm0
        ret
bar:
        movdqa  xmm0, XMMWORD PTR .LC0[rip]
        movups  XMMWORD PTR [rdi], xmm0
        ret
.LC0:
        .quad   2
        .quad   1

With more complex surrounding code, _mm_loadu_si128 can fold into a memory source operand for ALU only with AVX (e.g. vpaddb xmm0, xmm1, [rdi], but _mm_load_si128 aligned loads can fold into SSE memory sources like paddb xmm0, [rdi].

Peter Cordes
  • 328,167
  • 45
  • 605
  • 847
  • thx. so should I expect that the latency and throughput of these two methods are the same? I can not find any info about __m128i_u, that's why I'm asking. – vela18 Aug 26 '21 at 13:56
  • also I can use my method with 256-bit registers even on CPU which does not support avx2. But I can not use there _mm256_storeu_si256. Am i right? – vela18 Aug 26 '21 at 13:59
  • @vela18: Nothing documents it because it's not intended for use, except in GCC/clang's implementation of `_mm_loadu_si128`. (Which works . It's just a mechanism for getting C compilers to emit unaligned loads / stores, see updated answer. I had some of that update already written before you commented.) – Peter Cordes Aug 26 '21 at 14:12
  • @vela18: `vmovdqa` / [`vmovdqu`](https://www.felixcloutier.com/x86/movdqu:vmovdqu8:vmovdqu16:vmovdqu32:vmovdqu64) somewhat surprisingly only require AVX1, so yes you can use `_mm256_storeu_si256` if you have AVX1 without AVX2 (e.g. Sandybridge or Bulldozer). More normally you'd just use `_mm256_storeu_ps` with AVX1 if you don't have AVX2. But note that all existing CPUs without AVX2 will actually do 256-bit load/stores as two 16-bit halves, with SnB having extra penalties for misalignment. – Peter Cordes Aug 26 '21 at 14:15
  • So if you're writing a version of a function intended only for CPUs without AVX2, copying unaligned memory around using 16-byte integer vectors is probably better. If the code will run on anything with AVX, including modern CPUs that have AVX2, then yeah probably just use `__m256` or `__m256i`. And when you compile, be aware of tuning options `-mavx256-split-unaligned-load` which is unfortunately on by default: [Why doesn't gcc resolve \_mm256\_loadu\_pd as single vmovupd?](https://stackoverflow.com/q/52626726) – Peter Cordes Aug 26 '21 at 14:18