Similar to this question, I'd like to gather several 24-bit values into 32-bit dwords of an SSE/AVX register. Further:
- each value is at a non-contiguous offset from a base pointer
- each value's offset has only 1-byte alignment
- I can ensure reading a vector beyond (or before) each value is safe
An AVX2 (performant?) gather solution is OK, but I also need pre-AVX support. It looks like pinsrd with the SIB byte indicating 1-byte alignment does exactly what I want, but I can't figure out how to get the compiler to emit this instruction encoding...
Using the standard intrinsic:
uint32_t *p = &base[offset];
vec = _mm_insert_epi32(vec, *p, 1); // for each dword...
Yields reasonable encoding, assuming an aligned offset:
660f3a2244_b5_0001 pinsrd $0x1, (%rbp,%rsi,4), %xmm0
But, I'd like to actually emit:
660f3a2244_35_0001 pinsrd $0x1, (%rbp,%rsi), %xmm0
and manually pre-multiply offset by 3.
This encoding (tested via hex editing a linked binary) appears to work just fine. But... how can I emit it? No amount of type casting or attribute
__align__
seems to work. The obvious approach:
uint8_t *p = &base[offset*3];
vec = _mm_insert_epi32(vec, *p, 1);
of course dereferences one byte with zero-extension to a dword before inserting.
My inline asm attempt:
static inline __m128i __attribute__((always_inline))
_mm_insertu_epi32(__m128i a, void *b, long o, const int8_t imm8)
{
__asm__("pinsrd %3, (%1, %2), %0" : "+x"(a) : "r"(b), "r"(o), "i"(imm8));
return a;
}
Yields:
660f3a22041601 pinsrd $0x1, (%rsi,%rdx), %xmm0
Which is promising, but appears to completely confuse the optimizer; all of the surrounding code is perturbed beyond recognition.
Is there a way to do this without pure asm? (I'd like to use the intrinsic...)
See also: Dereference pointers in XMM register