1

I'm optimizing a hot path in my codebase and i have turned to vectorization. Keep in mind, I'm still quite new to all of this SIMD stuff. Here is the problem I'm trying to solve, implemented using non-SIMD

inline int count_unique(int c1, int c2, int c3, int c4)
{
    return 4 - (c2 == c1)
             - ((c3 == c1) || (c3 == c2))
             - ((c4 == c1) || (c4 == c2) || (c4 == c3));
}

the assembly output after compiling with -O3:

count_unique:
        xor     eax, eax
        cmp     esi, edi
        mov     r8d, edx
        setne   al
        add     eax, 3
        cmp     edi, edx
        sete    dl
        cmp     esi, r8d
        sete    r9b
        or      edx, r9d
        movzx   edx, dl
        sub     eax, edx
        cmp     edi, ecx
        sete    dl
        cmp     r8d, ecx
        sete    dil
        or      edx, edi
        cmp     esi, ecx
        sete    cl
        or      edx, ecx
        movzx   edx, dl
        sub     eax, edx
        ret

How would something like this be done when storing c1,c2,c3,c4 as a 16byte integer vector?

Peter Cordes
  • 328,167
  • 45
  • 605
  • 847
Cloud11665
  • 94
  • 1
  • 6
  • In your real use-case, you don't have 4 "loose" scalars coming from different places, right? You can efficiently get them into a `__m128i` with a single `_mm_loadu_si128` or something. – Peter Cordes Apr 11 '22 at 16:30
  • 3
    SIMD hates interlane operations. You can rotate the vector to each of the four positions, do element-wise compares, sum the number of “equals”, and map that to the number of unique elements. (If I have figured correctly, you will get 0 “equal” results for 4 unique elements, 2 for 3, 4 or 6 for 2, and 12 for 1.) And there may be no very good solution. When faced with things like this in SIMD, the usual course is to ask what context this appears in, whether this part of the computation can be deferred to some other point after main SIMD work, or whether there is an alternative computation. – Eric Postpischil Apr 11 '22 at 16:33
  • (Illustration: Consider elements AAAB. Rotate to AABA and compare, getting 2 “equals”. Rotate to ABAA and compare to original, getting 2 more. Rotate to BAAA and compare, getting 2 more. Total 6. Consider ABAB. Rotate to BABA and compare, getting 0 “equals.” Rotate to ABAB and get 4. Rotate to BABA and get 0. Total 4.) – Eric Postpischil Apr 11 '22 at 16:37

2 Answers2

4

For your simplified problem (test all 4 lanes for equality), I would do it slightly differently, here’s how. This way it only takes 3 instructions for the complete test.

// True when the input vector has the same value in all 32-bit lanes
inline bool isSameValue( __m128i v )
{
    // Rotate vector by 4 bytes
    __m128i v2 = _mm_shuffle_epi32( v, _MM_SHUFFLE( 0, 3, 2, 1 ) );
    // The XOR outputs zero for equal bits, 1 for different bits
    __m128i xx = _mm_xor_si128( v, v2 );
    // Use PTEST instruction from SSE 4.1 set to test the complete vector for all zeros
    return (bool)_mm_testz_si128( xx, xx );
}
Soonts
  • 20,079
  • 9
  • 57
  • 130
  • With SSE4.1 available, clang even does that optimization for you. (It uses `psubd`, but `pxor` is just as good if not better on some CPUs.) https://godbolt.org/z/WdbhEd8cK. And yes, `pshufd` is a better choice than `palignr`, can avoid a movdqa without AVX, and smaller size in bytes. – Peter Cordes Apr 11 '22 at 20:51
0

Ok, I have "simplified" the problem, because the only case when i was using the unique count, was if it was 1, but that is the same as checking if all elements are the same, which can be done by comparing the input with itself, but shifted over by one element (4 bytes) using the _mm_alignr_epi8 function.

inline int is_same_val(__m128i v1) {
    __m128i v2 = _mm_alignr_epi8(v1, v1, 4);
    __m128i vcmp = _mm_cmpeq_epi32(v1, v2);
    return ((uint16_t)_mm_movemask_epi8(vcmp) == 0xffff);
}
Cloud11665
  • 94
  • 1
  • 6
  • `_mm_shuffle_epi32(v1, _MM_SHUFFLE(2,1,0,3))` is a more efficient way to rotate by 4 bytes. Also, you don't need to cast movemask to `(uint16_t)`; the compiler hopefully already knows the high 16 bits are zero, and if it doesn't you still don't want to force 16-bit operand-size or an extra `movzx` instruction. Actually, gcc does use `cmp ax,-1` but it's not bad, only 4 bytes without an LCP stall, since `-1` fits in an imm8. Either way, clang interestingly optimizes into a `psubd` / `ptest` / `sete` to check that the whole register is zero. https://godbolt.org/z/WdbhEd8cK – Peter Cordes Apr 11 '22 at 20:48