10

The CUDA programming guide introduced the concept of warp vote function, "_all", "_any" and "__ballot".

My question is: what applications will use these 3 functions?

Fan Zhang
  • 609
  • 1
  • 9
  • 17

4 Answers4

8

The prototype of __ballot is the following

unsigned int __ballot(int predicate);

If predicate is nonzero, __ballot returns a value with the Nth bit set, where N is the thread index.

Combined with atomicOr and __popc, it can be used to accumulate the number of threads in each warp having a true predicate.

Indeed, the prototype of atomicOr is

int atomicOr(int* address, int val);

and atomicOr reads the value pointed to by address, performs a bitwise OR operation with val, and writes the value back to address and returns its old value as a return parameter.

On the other side, __popc returns the number of bits set withing a 32-bit parameter.

Accordingly, the instructions

volatile __shared__ u32 warp_shared_ballot[MAX_WARPS_PER_BLOCK];

const u32 warp_sum = threadIdx.x >> 5;

atomicOr(&warp_shared_ballot[warp_num],__ballot(data[tid]>threshold));

atomicAdd(&block_shared_accumulate,__popc(warp_shared_ballot[warp_num]));

can be used to count the number of threads for which the predicate is true.

For more details, see Shane Cook, CUDA Programming, Morgan Kaufmann

Vitality
  • 20,705
  • 4
  • 108
  • 146
  • 1
    beautiful, the documentation on ballot and its successor is very sparse. This gave me some intuition as to what `__ballot_sync` is doing. Thanks @jackolantern! – interestedparty333 Mar 18 '19 at 19:48
5

__ballot is used in CUDA-histogram and in CUDA NPP library for quick generation of bitmasks, and combining it with __popc intrinsic to make a very efficient implementation of boolean reduction.

__all and __any was used in reduction before introduction of __ballot, though I can not think of any other use of them.

aland
  • 4,829
  • 2
  • 24
  • 42
1

As an example of algorithm that uses __ballot API i would mention the In-Kernel Stream Compaction by D.M Hughes et Al. It is used in prefix sum part of the stream compaction to count (per warp) the number of elements that passed the predicate.

Here the paper. In-k Stream Compaction

Community
  • 1
  • 1
Davide Spataro
  • 7,319
  • 1
  • 24
  • 36
  • This sounds super interesting. Is there any implementation I can look at? – aatish Jan 26 '16 at 21:10
  • yes, I wrote an improved version of that algorithm. https://github.com/knotman90/cuStreamComp . Please ask me if you need clarifications or benchmarks. – Davide Spataro Jan 26 '16 at 21:15
  • Actually it would be nice if you have some benchmarks against thrust library. Also I think that in line 78 of cuCompactor.cuh, it should be possible to have another global array called d_output_index which would contain the the value of idx corresponding to where the original data came from. Am I correct? – aatish Jan 26 '16 at 21:35
  • 1
    Also, I think you should add a LICENSE.txt file to your repo. – aatish Jan 26 '16 at 21:40
  • write me on github or visit davidespataro.it to get my email so we can discuss. – Davide Spataro Jan 26 '16 at 21:41
  • Cool. Found your email. I will write to you soon. – aatish Jan 26 '16 at 22:00
0

CUDA provides several warp-wide broadcast and reduction operations that NVIDIA’s architectures efficiently support. For example, __ballot(predicate) instruction evaluates predicate for all active threads of the warp and returns an integer whose Nth bit is set if and only if predicate evaluates to non-zero for the Nth thread of the warp and the Nth thread is active [Reference: Flexible Software Profiling of GPU Architectures].