Optimizing a threaded simultaneous check

Question

I have a device function that checks a byte array using threads, each thread checking a different byte in the array for a certain value and returns bool true or false.

How can I efficiently decide if all the checks have returned true or otherwise?

CUDA has warp voting functions that can be used to construct a fairly efficient "any"/"all"/"none" type binary reduction *at the block level*. You probably can't inspect the results *all* checks across the entire grid within a running kernel, because it requires synchronisation across the whole grid. A second kernel launch or a small host side reduction would be necessary to get the state across the whole grid. — talonmies, Jul 01 '12 at 15:44
Thanks, I'll go look up on the voting function. Anyway, I'm not trying to check across grids, just within a block. — gamerx, Jul 01 '12 at 15:58
It seems the warp vote as the name suggests only allow you to check within a warp, however I have more than the warp size number of threads running. Can it still be used? — gamerx, Jul 01 '12 at 16:04
An easy implementation would be to have a single thread in each warp "or" the warp's result onto a value in shared memory and then have a single thread in the block check that value in the end. — Roger Dahl, Jul 01 '12 at 16:13
@gamerx: Of course. Use warp voting to reduce each warp to a single value, then reduce the per warp values to a per block and write that out to global memory. It is just a slightly modified version of the standard shared memory reduction. — talonmies, Jul 01 '12 at 16:14
@ talonmies Do you mind giving me an example of how to perform the warp voting reduction? I have a function __ device__ bool checkByte() that I run in the kernel. The functions checks through 40 bytes simultenously(1 thread/byte). Since each warp is a collection of 32 threads, something like: if(__all(checkBytes())) will only check the first 32 threads, how do use voting on the remaining ones? Sorry if this seems very obvious to you guys, I'm new to cuda and in fact any kind of parallel programming. Thanks again. — gamerx, Jul 02 '12 at 13:16
I provided a complete function using `__all()` in my answer below. — harrism, Jul 04 '12 at 01:14

harrism · Accepted Answer · 2012-07-03T12:16:42.747

// returns true if predicate is true for all threads in a block
__device__ bool unanimous(bool predicate) { ... }

__device__ bool all_the_same(unsigned char* bytes, unsigned char value, int n) {
    return unanimous(bytes[threadIdx.x] == value);
}

The implementation of unanimous() depends on the compute capability of your hardware. For compute capability 2.0 or higher devices, it is trivial:

__device__ bool unanimous(bool predicate) { return __syncthreads_and(predicate); }

For compute capability 1.0 and 1.1 devices, you will need to implement an AND reduction (exercise for the reader, since it's well documented). For the special case of compute capability 1.3, you can optimize the AND reduction using warp vote instructions, using the __all() intrinsic function provided in the CUDA headers.

edit:

OK, since gamerx is asking in the comments. On sm_13 hardware, you can do this.

// returns true if predicate is true for all threads in a block
// note: supports maximum of 1024 threads in block as written
__device__ bool unanimous(bool predicate) {
    __shared__ bool warp_votes[32];
    if (threadIdx.x < warpSize) warp_votes[threadIdx.x] = true;
    warp_votes[threadIdx.x / warpSize] = __all(pred);
    __syncthreads();
    if (threadIdx.x < warpSize) warp_votes[0] = __all(warp_votes[threadIdx.x];
    __syncthreads();
    return warp_votes[0];
}

Optimizing a threaded simultaneous check

1 Answers1