pinsrd / _mm_insert_epi32 with byte pointer alignment?

Question

Similar to this question, I'd like to gather several 24-bit values into 32-bit dwords of an SSE/AVX register. Further:

each value is at a non-contiguous offset from a base pointer
each value's offset has only 1-byte alignment
I can ensure reading a vector beyond (or before) each value is safe

An AVX2 (performant?) gather solution is OK, but I also need pre-AVX support. It looks like pinsrd with the SIB byte indicating 1-byte alignment does exactly what I want, but I can't figure out how to get the compiler to emit this instruction encoding...

Using the standard intrinsic:

uint32_t *p = &base[offset];
vec = _mm_insert_epi32(vec, *p, 1);  // for each dword...

Yields reasonable encoding, assuming an aligned offset:

660f3a2244_b5_0001 pinsrd   $0x1, (%rbp,%rsi,4), %xmm0

But, I'd like to actually emit:

660f3a2244_35_0001 pinsrd   $0x1, (%rbp,%rsi), %xmm0

and manually pre-multiply offset by 3.

This encoding (tested via hex editing a linked binary) appears to work just fine. But... how can I emit it? No amount of type casting or attribute __align__ seems to work. The obvious approach:

uint8_t *p = &base[offset*3];
vec = _mm_insert_epi32(vec, *p, 1);

of course dereferences one byte with zero-extension to a dword before inserting.

My inline asm attempt:

static inline __m128i __attribute__((always_inline))
_mm_insertu_epi32(__m128i a, void *b, long o, const int8_t imm8)
{
    __asm__("pinsrd %3, (%1, %2), %0" : "+x"(a) : "r"(b), "r"(o), "i"(imm8));
    return a;
}

Yields:

660f3a22041601      pinsrd  $0x1, (%rsi,%rdx), %xmm0

Which is promising, but appears to completely confuse the optimizer; all of the surrounding code is perturbed beyond recognition.

Is there a way to do this without pure asm? (I'd like to use the intrinsic...)

See also: Dereference pointers in XMM register

Why exactly? This whole approach is mostly broken, if you just want to expand packed 24bit to 32bit, load a bunch of (pixels?) and shuffle them. — harold, Apr 29 '17 at 11:06
@harold, can you clarify "broken"? Do you mean "inefficient", or "functionally incorrect" (due to some alignment restriction on real-world CPUs?) As to why, simply to gather 24bit values into a vector for further processing. Pre-AVX2, pinsrd appears to be the best instruction to: * load from memory with indexed addressing * into an arbitrary position in a vector But, unlike vpgatherdd, there is no way to directly control the index scale via the instrinsic? If there's a better (efficient, correct) way to gather from several unaligned offsets into vector elements, please educate me. — arekkusu, Apr 29 '17 at 19:29
@harold, they are not packed. That's what I meant by "each value is at a non-contiguous offset". So this is a gather. — arekkusu, Apr 29 '17 at 19:53
Fair enough. Still, the entire scale=1 thing is completely unnecessary, you can just put anything into the intrinsic and it'll work right. It's better to mix movd and pinsrd though, to avoid completely serializing the annoying high latency of pinsrd. So 2x movd, 2x pinsrd, combine the halves with an unpack. FWIW GCC does this automatically if you just _mm_set everything.. that's not something I usually recommend but in this case it sort of makes sense. Really this whole situation should be avoided in the first place, if at all possible. — harold, Apr 29 '17 at 20:04
Anyway I guess the main point is there is no alignment requirement, that inline asm attempt to force a particular encoding is not necessary. — harold, Apr 29 '17 at 20:07

score 0 · Accepted Answer · answered Apr 29 '17 at 22:16

@harold, thanks.

I was already doing movd followed by several pinsrd (like clang.) But I see on godbolt that clang/gcc/icc use various unpack patterns, so I'll profile them.

"Just avoid gather" isn't a solution, unfortunately. But you're right, the intrinsic does work with arbitrary alignment. Simple pointer casting ends up doing the right thing (that is, producing a possibly un-aligned address):

__m128i gather32_scale4(int *b, long o0, long o1, long o2, long o3)
{
    return _mm_set_epi32(b[o0], b[o1], b[o2], b[o3]);
    //  movd    xmm0, dword ptr [rdi + 4*r8]
    //  pinsrd  xmm0, dword ptr [rdi + 4*rcx], 1
    //  pinsrd  xmm0, dword ptr [rdi + 4*rdx], 2
    //  pinsrd  xmm0, dword ptr [rdi + 4*rsi], 3
}

__m128i gather32_scale1(int *b, long o0, long o1, long o2, long o3)
{
    return _mm_set_epi32(
        *(int *)&((char *)b)[o0],
        *(int *)&((char *)b)[o1],
        *(int *)&((char *)b)[o2],
        *(int *)&((char *)b)[o3]);
    //  movd    xmm0, dword ptr [rdi + r8]
    //  pinsrd  xmm0, dword ptr [rdi + rcx], 1
    //  pinsrd  xmm0, dword ptr [rdi + rdx], 2
    //  pinsrd  xmm0, dword ptr [rdi + rsi], 3
}

(and similar for manually-written _mm_insert_epi32)

pinsrd / _mm_insert_epi32 with byte pointer alignment?

1 Answers1