1

I am using the SIMD api in Java:

// both `buffer` and `markVector` are ByteVector
var result = buffer.and(markVector);

My requirement is to check whether all bits in result are 0 efficiently.

A workaround way is to convert it to byte[], and then covert every individual byte to int, and finally check whether it is 0 one by one. But this method does not leverage the SIMD feature.

Any idea about how to check whether all bits of ByteVector are 0 in SIMD?

chenzhongpu
  • 6,193
  • 8
  • 41
  • 79
  • I don't know the Java API, but the optimal strategy is very different for x86 vs. ARM, and even 32-bit ARM and NEON are different enough that I think a different strategy is optimal. On x86 with SSE4.1 `ptest xmm0,xmm0` to set ZF in FLAGS, i.e. `_mm_test_all_zeros` or AVX `_mm256_testz_si256`. Or in your case, you're testing a bitwise AND, so you'd just use `ptest` directly between two different inputs. Without SSE4.1, SSE2 `pcmpeqb` against a zeroed register then `pmovmskb eax, xmm0` to get a scalar integer bitmap of the packed-compare result. – Peter Cordes Mar 06 '23 at 07:15
  • ARM SIMD doesn't have a `pmovmskb` equivalent or a way to set condition-codes for branching based on vector instructions. But AArch64 has a right-shift-and-insert or something that can narrow a packed-compare result from 128 to 64 bits, same width as an integer register. And I think most ARMv8 CPUs don't stall when moving data from SIMD to integer regs, unlike some 32-bit ARM CPUs. For 32-bit, IIRC your best bet could be to OR the two `d` registers that make up a 128-bit `q` register, then reduce to 32-bit with a horizontal add or OR or something. – Peter Cordes Mar 06 '23 at 07:16
  • So anyway, to JIT to efficient code on different ISAs, the Java API hopefully can do something high-enough level that you don't have to pick one of those strategies and try to express the details in a portable API, because the horizontal reduction down to 32-bit would be a lot less efficient on x86-64. – Peter Cordes Mar 06 '23 at 07:21
  • I am new to SIMD. Java's `Vector` API (of course still in incubator stage) aims to provide architecture independent abstractions, so using the lower level or CPU specific API (e.g., `NEON` or `AVX`) is not my option. – chenzhongpu Mar 06 '23 at 07:29
  • I wasn't suggesting writing Java source using CPU-specific intrinsics. See my last comment for the point of all that, that if there wasn't a high-level thing like `.allTrue()`, you would have been forced to implement that yourself out of whatever other operations it provides. So you'd have had to pick an implementation strategy. So it's a good thing there is a `.allTrue()` to let the JIT use a good strategy for the ISA it's running on. And it's important to use it instead of rolling your own, since there *are* ISA-specific tricks that a good JIT will use. – Peter Cordes Mar 06 '23 at 07:32
  • @PeterCordes Thanks. Your comments are very inspiring for me :) – chenzhongpu Mar 06 '23 at 07:33

1 Answers1

1

After some search on Java Doc, I found a feasible solution through eq() which produces a VectorMask:

A VectorMask represents an ordered immutable sequence of boolean values.

var result = buffer.and(markVector);

if (result.eq((byte) 0x00).allTrue())
   System.out.println("All Zeros");
chenzhongpu
  • 6,193
  • 8
  • 41
  • 79