0

I see reductions algorithms in CUDA (such as summation and maximization over a range of elements) discussed in previous posts, but with dynamic parallelism, they could potentially be implemented in a different way. Is there a more efficient implementation which is callable from inside the kernels?

shaoyl85
  • 1,854
  • 18
  • 30

1 Answers1

1

Is there a more efficient implementation which is callable from inside the kernels?

CUB provides a CUDA reduction primitive compatible with dynamic parallelism, namely, that can be called within kernels.

Vitality
  • 20,705
  • 4
  • 108
  • 146
  • Wonderful! That's exactly what I'm looking for! – shaoyl85 Jan 14 '14 at 02:33
  • do you also know any library that can calculate multiple k-selection in parallel, for example, 1000000 k-selection in parallel, each is to find the k-largest element among around 10000 elements – shaoyl85 Jan 14 '14 at 02:35