I have a device function that checks a byte array using threads, each thread checking a different byte in the array for a certain value and returns bool true or false.
How can I efficiently decide if all the checks have returned true or otherwise?
I have a device function that checks a byte array using threads, each thread checking a different byte in the array for a certain value and returns bool true or false.
How can I efficiently decide if all the checks have returned true or otherwise?
// returns true if predicate is true for all threads in a block
__device__ bool unanimous(bool predicate) { ... }
__device__ bool all_the_same(unsigned char* bytes, unsigned char value, int n) {
return unanimous(bytes[threadIdx.x] == value);
}
The implementation of unanimous()
depends on the compute capability of your hardware. For compute capability 2.0 or higher devices, it is trivial:
__device__ bool unanimous(bool predicate) { return __syncthreads_and(predicate); }
For compute capability 1.0 and 1.1 devices, you will need to implement an AND reduction (exercise for the reader, since it's well documented). For the special case of compute capability 1.3, you can optimize the AND reduction using warp vote instructions, using the __all()
intrinsic function provided in the CUDA headers.
edit:
OK, since gamerx is asking in the comments. On sm_13 hardware, you can do this.
// returns true if predicate is true for all threads in a block
// note: supports maximum of 1024 threads in block as written
__device__ bool unanimous(bool predicate) {
__shared__ bool warp_votes[32];
if (threadIdx.x < warpSize) warp_votes[threadIdx.x] = true;
warp_votes[threadIdx.x / warpSize] = __all(pred);
__syncthreads();
if (threadIdx.x < warpSize) warp_votes[0] = __all(warp_votes[threadIdx.x];
__syncthreads();
return warp_votes[0];
}