Is there a good implementation of reduction algorithm callable from kernel with dynamic parallelism?

Question

I see reductions algorithms in CUDA (such as summation and maximization over a range of elements) discussed in previous posts, but with dynamic parallelism, they could potentially be implemented in a different way. Is there a more efficient implementation which is callable from inside the kernels?

score 1 · Accepted Answer · answered Jan 12 '14 at 21:16

1

Is there a more efficient implementation which is callable from inside the kernels?

CUB provides a CUDA reduction primitive compatible with dynamic parallelism, namely, that can be called within kernels.

answered Jan 12 '14 at 21:16

Vitality

20,705
4
108
146

Wonderful! That's exactly what I'm looking for! – shaoyl85 Jan 14 '14 at 02:33
do you also know any library that can calculate multiple k-selection in parallel, for example, 1000000 k-selection in parallel, each is to find the k-largest element among around 10000 elements – shaoyl85 Jan 14 '14 at 02:35

Is there a good implementation of reduction algorithm callable from kernel with dynamic parallelism?

1 Answers1