0

I have several __m128i vectors containing 32-bit unsigned integers and I would like to check whether any of the 4 integers is a zero.

I understand how I can "aggregate" the multiple __m128i vectors but eventually I will still end up with a single __m128i vector, which I will then need to check horizontally.

How do I perform the final horizontal check for zero across the last vector?

EDIT I am using Intel intrinsics, not inline assembly

user997112
  • 29,025
  • 43
  • 182
  • 361

1 Answers1

5

Don’t do it. Avoid horizontal operation whenever possible; it is death to performance of vector code.

Instead, compare the vector to a vector of zeros, then use PMOVMSKB to get a mask in GPR. If that mask is non-zero, at least one of the lanes of your vector was zero:

__m128i yourVector;
__m128i zeroVector = _mm_set1_epi32(0);

if (_mm_movemask_epi8(_mm_cmpeq_epi32(yourVector,zeroVector))) {
    // at least one lane of your vector is zero.
}

You can also use PTEST if you want to assume SSE4.1.


Taking the question at face value, if you really did need to do a horizontal and for some reason, it would be movhlps + andps + shufps + andps. But don’t do that.

Stephen Canon
  • 103,815
  • 19
  • 183
  • 269
  • (PTEST isn't useful here since it will tell you whether *all* the lanes are zero, rather than whether *any* lane is zero.) – Raymond Chen Apr 21 '14 at 21:41
  • 1
    @RaymondChen But that can be easily inverted. – Mysticial Apr 21 '14 at 21:41
  • 1
    @RaymondChen: as Mystical notes, any lane zero is the same as !(all lanes non-zero), which PTEST can do. – Stephen Canon Apr 21 '14 at 21:42
  • Two questions: one could you elaborate on your solution? I am using intrinsics and have no idea what GPR is. Two could this approach of avoiding horizontal operations be applied to summation? I am summing across an array and I use multiple __m128i vectors. Each vector contains 4 "mini-sums" but eventually I need one sum value. I cannot see how I could end-up with one sum value unless I do a horizontal summation at the end? – user997112 Apr 21 '14 at 21:46
  • 1
    @user997112 No. Summing will require actually adding them up. There will not be horizontal reduction instructions for addition until AVX512. – Mysticial Apr 21 '14 at 21:48
  • @user997112: What mystical said. If you need to add, you need to add. Checking for zero is a much simpler operation. (But: do you really need to do horizontal summation? Is there someway you could modify your data layout to avoid doing it?) – Stephen Canon Apr 21 '14 at 21:50
  • 2
    @StephenCanon It sounds like this is a reduction of a larger vector. At the end you'll still have to reduce over a single vector. But in that case, it's probably not performance critical because it's O(1) of an O(N) operation. – Mysticial Apr 21 '14 at 21:51
  • @Mysticial: agreed; I just like to make sure. People tend to be horizontal-operation happy when they’re starting out writing vector code. – Stephen Canon Apr 21 '14 at 21:54
  • @StephenCanon Horizontal-happy is still better than set-happy: http://stackoverflow.com/a/23186488/922184 :D – Mysticial Apr 21 '14 at 21:56
  • @Mysticial: One would like to think that compilers would manage to optimize that particular horror away. One would like to think. – Stephen Canon Apr 21 '14 at 22:07
  • 1
    @StephenCanon if I write horizontal-vector code its always after the loop, once all the parallel processing has been done and I need to aggregate the results. – user997112 Apr 22 '14 at 01:31
  • @user997112: that's a perfectly appropriate usage of a horizontal operation. – Stephen Canon Apr 24 '14 at 01:54