Questions tagged [cub]

CUB provides state-of-the-art, reusable software components for every layer of the CUDA programming model.

CUB (CUDA UnBound) is a C++ template library of components for use on NVIDIA GPUs running CUDA.

CUB includes common data parallel operations such as prefix scan, reduction, histogram and sort. CUB's collective primitives are not bound to any particular width of parallelism or to any particular data type and can be used at device, block, warp or thread scope.

It is used in the backend of other NVIDIA libraries, most prominently Thrust and RAPIDS.

CUB is developed by NVIDIA Research and it's website and documentation is hosted at https://nvlabs.github.io/cub with the most recent source code being available on GitHub. It is also distributed with the CUDA Toolkit since at least CUDA 11.1.1 (first version where CUB documentation is linked from CUDA Tookit documentation).

48 questions
0
votes
1 answer

CUB sum reduction with 2D pitched arrays

I am trying to perform a sum reduction using CUB and 2D arrays of type float/double. Although it works for certain combinations of rows+columns, for relatively larger arrays, I get an illegal memory access error during the last transfer. A minimal…
Aristotelis
  • 141
  • 1
  • 10
0
votes
1 answer

What is the proper way to enable cub in cupy?

I am trying to figure out the proper way to enable cub in cupy, but without success so far. I looked into the documentation and I couldn't find anything. At the moment I enable cub like this: import cupy.core._accelerator as…
AstrOne
  • 3,569
  • 7
  • 32
  • 54
0
votes
1 answer

How to compile C++ with CUB library?

I am using the CUB device function just like the example here (https://forums.developer.nvidia.com/t/cub-library/37675/2). I was able to compile the .cu source file in the above example using nvcc. However, I wonder if it is possible to call CUB…
Jane
  • 1
  • 3
0
votes
1 answer

Is there a way to use CUB::BlockScan on oddly sized data arrays?

All the examples perform scans on arrays sized by some multiple of 32. The quickest examples use 256 or more threads with 4 or more elements assigned to each thread. This means, that if I had an array of size 450, then, presumably, I would have to…
jpreed00
  • 893
  • 8
  • 25
0
votes
1 answer

CUB sort with iterator

I would like to transform values and sort them in one go, like this: thrust::vector dataIn = ... thrust::vector dataOut = ... auto iterIn = cub::TransformInputIterator(dataIn.begin(),…
hrvthzs
  • 83
  • 9
0
votes
1 answer

dot_product with CUDA_CUB

__global__ void sum(const float * __restrict__ indata, float * __restrict__ outdata) { unsigned int tid = blockIdx.x * blockDim.x + threadIdx.x; // --- Specialize BlockReduce for type float. typedef cub::BlockReduce
ztdep
  • 343
  • 1
  • 4
  • 17
0
votes
0 answers

Sorting shared memory array in cub

I want to use cub to sort an array in each block for me. I call the kernel with multiple blocks, each has 32 threads and each thread has an array of 27 integers. The standard sort according to cubs github page looks like this: __global__ void…
John
  • 39
  • 2
0
votes
1 answer

Incorrect results with CUB ReduceByKey when specifying gencode

In one of my projects, I'm seeing some incorrect results when using CUB's DeviceReduce::ReduceByKey. However, using the same inputs/outputs with thrust::reduce_by_key produces the expected results. #include "cub/cub.cuh" #include #include…
smish
  • 3
  • 2
0
votes
1 answer

How to sort an array of CUDA vector types

Specifically how could I sort an array of float3? Such that the .x components are the primary sort criteria, the .y components are the secondary sort criteria and the .z components are the tertiary sort criteria. Is there a simple solution that can…
inJeans
  • 199
  • 1
  • 9
0
votes
1 answer

How does CUB's TexRefInputIterator work?

CUB provides an iterator for texture references, the implementation of which is readily accessible. Since I couldn't figure out how to implement template-able texture references myself - they "can only be declared as a static global variable" - I am…
Sam
  • 557
  • 6
  • 20
0
votes
1 answer

Why is my inclusive scan code 2x faster on CPU than on a GPU?

I wrote a short CUDA program that uses the highly-optimized CUB library to demonstrate that one core from an old, quad-core Intel Q6600 processor (all four are supposedly capable of ~30 GFLOPS/sec) can do an inclusive scan (or cumulative/prefix sum…
tantrev
  • 450
  • 1
  • 4
  • 11
0
votes
1 answer

Using cudaDeviceSynchronize after a CUB class

Is it necessary to call cudaDeviceSynchronize after a CUB class is invoked from a CUDA kernel? When one uses say DeviceReduce::Sum() from the device, there are implicit memory copies that block the device from moving on, but after experiencing…
0
votes
1 answer

Getting CUB DeviceScan to work when called from a kernel

I am currently trying to learn how to use CUB to perhaps rewrite my integrator code. I've been looking at the examples and code snippets in the docs, but I have not yet found an example of what I'm trying to do. Specifically, that is to run an…
0
votes
1 answer

cuda and cub implementation of multiple k-selection

I'm trying to implement multiple top-k selection in parallel, where each selection select k elements from a list of n elements and there m such tasks to be executed in parallel. I use cub to do that. I got a strange error and I don't know where I…
shaoyl85
  • 1,854
  • 18
  • 30
-1
votes
1 answer

CUB segmented reduction not producing results

I'm trying to use CUB's segmented-reduction sum primitive, and I'm stuck on it. Here is my code: int main() { const int N = 7; const int num_segments = 3; int d_offsets[]= {0,3,3,7}; int *h_data = (int *)malloc(N *…