I see reductions algorithms in CUDA (such as summation and maximization over a range of elements) discussed in previous posts, but with dynamic parallelism, they could potentially be implemented in a different way. Is there a more efficient implementation which is callable from inside the kernels?
Is there a good implementation of reduction algorithm callable from kernel with dynamic parallelism?
Asked
Active
Viewed 430 times
1 Answers
1
Is there a more efficient implementation which is callable from inside the kernels?
CUB provides a CUDA reduction primitive compatible with dynamic parallelism, namely, that can be called within kernels.

Vitality
- 20,705
- 4
- 108
- 146
-
Wonderful! That's exactly what I'm looking for! – shaoyl85 Jan 14 '14 at 02:33
-
do you also know any library that can calculate multiple k-selection in parallel, for example, 1000000 k-selection in parallel, each is to find the k-largest element among around 10000 elements – shaoyl85 Jan 14 '14 at 02:35