1

Does there exist a quick way to check whether a SIMD vector is a zero vector (all components equal +-zero). I am currently using an algorithm, using shifts, that runs in log2(N) time, where N is the dimension of the vector. Does there exist anything faster? Note that my question is broader (tags), than the proposed answer and it refers to vectors of all types (integer, float, double, ...).

user1095108
  • 14,119
  • 9
  • 58
  • 116
  • 1
    possible duplicate of [Most efficient way to check if all \_\_m128i components are 0 \[using SSE intrinsics\]](http://stackoverflow.com/questions/27905677/most-efficient-way-to-check-if-all-m128i-components-are-0-using-sse-intrinsic) – Iwillnotexist Idonotexist Mar 14 '15 at 17:22
  • 1
    Depends on the instructions that are available, tagging neon and avx together I don't know what you're up to. – harold Mar 14 '15 at 18:54
  • @harold a list/table of intrinsics or ideas for doing this very common operation. If the question is too broad, I'll delete. – user1095108 Mar 14 '15 at 19:22
  • 2
    Well in the most general case you're probably stuck with log(n), this goes all the way down to if you implement an "is zero" instruction in hardware where your OR tree (OR pairs together until you have 1 bit that is 0 if and only if all input bits were 0) would be log(n) layers deep. So, unless we can shift the goalposts a little, it will be actually impossible. – harold Mar 14 '15 at 21:55
  • 1
    If `N` is large, then the obvious solution is to process the elements in groups of `W`, where `W` is largest block you can handle by ORing them together. Once you reach the end, you have options: 1) `ptest` on SSE4.1+ 2) compare with zero, then `pmovmskb` on SSE2+ and for integer types, or `movmskps`/`movmskpd` for single/double precision. If N is especially large, the final reduction won't be your bottleneck; It will be the streaming of data to be ORed into your registers. For NEON on ARMv7, the final reduction can be done with pairwise unsigned `max`es until 32-bit word size is reached. – Iwillnotexist Idonotexist Mar 14 '15 at 22:25
  • @IwillnotexistIdonotexist Did you take into account the signed zero? I think you didn't. – user1095108 Mar 14 '15 at 23:36
  • @harold I don't think the OR tree works with signed zero float. – user1095108 Mar 14 '15 at 23:46
  • 1
    @user1095108 In fact, I did, and @harold did too. If the only values in the vector are +0, the reduction result will be identically +0, and if there are only +0 and -0, then the OR-reduction will generate -0. And surely you must know that IEEE mandates that -0 compare equal to 0, so when you do that comparison for == 0 before the `movmskps`, you will correctly ignore the sign of zero. – Iwillnotexist Idonotexist Mar 15 '15 at 03:07
  • @user1095108 that's a trivial extension of the same principle – harold Mar 15 '15 at 09:32

2 Answers2

2

How about this straightforward avx code? I think it's O(N) and don't know how you could possibly do better without making assumptions about the input data - you have to actually read every value to know if its 0 so it's about doing as much of that as possible per cycle.

You should be able to massage the code to your needs. Should treat both +0 and -0 as zero. Will work for unaligned memory addresses but aligning to 32 byte addresses will make the loads faster. You may need to add something to deal with remaining bytes if size isn't a multiple of 8.

uint64_t num_non_zero_floats(float *mem_address, int size) {
    uint64_t num_non_zero = 0;
    __m256 zeros _mm256_setzero_ps ();
    for(i = 0; i != size; i+=8) {
        __m256 vec _mm256_loadu_ps (mem_addr + i);
        __m256 comparison_out _mm256_cmp_ps (zeros, vec, _CMP_EQ_OQ); //3 cycles latency, throughput 1
        uint64_t bits_non_zero = _mm256_movemask_ps(comparison_out); //2-3 cycles latency
        num_non_zero += __builtin_popcountll(bits_non_zero);
    }
    return num_non_zero;
}
Hal
  • 1,061
  • 7
  • 20
  • I think the __builtin_popcountll is irrelevant for determining zero. – user1095108 Apr 22 '15 at 15:37
  • yep, just test bits_non_zero and in addition, of course, drop out of the loop stopping unnecessary processing if it isn't zero if you want that optimisation. The code actually counts the number of floats in the vector that are non-zero as it's name suggests - hence the "you should be able to massage the code to your needs" – Hal Apr 23 '15 at 02:41
  • @Hal This is unlikely to be faster than just loading the data as `__m256i`, ORing into two separate accumulators, then at the end, ORing the accumulators together, casting to `__m256`, doing `_mm256_cmp_ps()` and performing `_mm256_movemask_ps()` or a `vptest`. That being said, a neat trick you could use to avoid having to do the mask move and popcount in the inner loop is to cast the compare result to integer, and _subtract_ it from an accumulator. If the compare result is all ones (`== -1`), subtracting it from an accumulator is adding `+1` to it. If it's 0, subtracting will have no effect. – Iwillnotexist Idonotexist May 14 '15 at 03:27
  • @Hal Briefly, `__m256i acc = _mm256_setzero_si256();/* In the loop... */ acc = _mm256_sub_epi32(acc, _mm256_castps_si256(comparison_out));`. And then at the end of the loop you sum up the 8 accumulators, and you never have to go though popcount. And this works in systems with SSE, and systems without popcount. – Iwillnotexist Idonotexist May 14 '15 at 03:31
1

If you want to test floats for +/- 0.0, then you can check for all the bits being zero, except the sign bit. Any set-bits anywhere except the sign bit mean the float is non-zero. (http://www.h-schmidt.net/FloatConverter/IEEE754.html)


Agner Fog's asm optimization guide points out that you can test a float or double for zero using integer instructions:

; Example 17.4b
mov  eax, [rsi]
add  eax, eax   ; shift out the sign bit
jz   IsZero

For vectors, though, using ptest with a sign-bit mask is better than using paddd to get rid of the sign bit. Actually, test [rsi], $0x7fffffff may be more efficient than Agner Fog's load/add sequence, but a 32bit immediate probably stops the load from micro-fusing on Intel, and maybe have a larger code-size.


x86 PTEST (SSE4.1) does a bitwise AND and sets flags based on the result.

movdqa xmm0, [mask]
.loop:
ptest  xmm0, [rsi+rcx]
jnz    nonzero
add    rcx, 16  # count up towards zero
jl     .loop    # with rsi pointing to past the end of the array
...
nonzero:

Or cmov could be useful to consume the flags set by ptest.

IDK if it'd be possible to use a loop-counter instruction that didn't set the zero flag, so you could do both tests with one jump instruction or something. Probably not. And the extra uop to merge the flags (or the partial-flags stall on earlier CPUs) would cancel out the benefit.

@Iwillnotexist Idonotexist: re one of your comments on the OP: you can't just movemask without doing a pcmpeq first, or a cmpps. The non-zero bit might not be in the high bit! You probably knew that, but one of your comments seemed to leave it out.

I do like the idea of ORing together multiple values before actually testing. You're right that sign-bits would OR with other sign-bits, and then you ignore them the same way you would if you were testing one at a time. A loop that PORs 4 or 8 vectors before each PTEST would probably be faster. (PTEST is 2 uops, and can't macro-fuse with a jcc.)

Peter Cordes
  • 328,167
  • 45
  • 605
  • 847