Making CUB blockradixsort on-chip entirely?

Question

I am reading the CUB documentations and examples:

#include <cub/cub.cuh>   // or equivalently <cub/block/block_radix_sort.cuh>
__global__ void ExampleKernel(...)
{
    // Specialize BlockRadixSort for 128 threads owning 4 integer items each
typedef cub::BlockRadixSort<int, 128, 4> BlockRadixSort;
    // Allocate shared memory for BlockRadixSort
__shared__ typename BlockRadixSort::TempStorage temp_storage;
    // Obtain a segment of consecutive items that are blocked across threads
int thread_keys[4];
...
    // Collectively sort the keys
BlockRadixSort(temp_storage).Sort(thread_keys);
...
}

In the example, each thread has 4 keys. It looks like 'thread_keys' will be allocated in global local memory. If I only has 1 key per thread, could I declare"int thread_key;" and make this variable in register only?

BlockRadixSort(temp_storage).Sort() is taking a pointer to the key as parameter. Does it mean that the keys have to be in global memory?

I would like to use this code but I want each thread to hold one key in register and keep it on-chip in register/shared memory after they are sorted. Thanks in advance!

score 4 · Answer 1 · edited May 23 '17 at 12:34

You can do this using shared memory (which will keep it "on-chip"). I'm not sure I know how to do it using strictly registers without de-constructing the BlockRadixSort object.

Here's an example code that uses shared memory to hold the initial data to be sorted, and the final sorted results. This sample is mostly set up for one data element per thread, since that seems to be what you are asking for. It's not difficult to extend it to multiple elements per thread, and I have put most of the plumbing in place to do that, with the exception of the data synthesis and debug printouts:

#include <cub/cub.cuh>
#include <stdio.h>
#define nTPB 32
#define ELEMS_PER_THREAD 1

// Block-sorting CUDA kernel (nTPB threads each owning ELEMS_PER THREAD integers)
__global__ void BlockSortKernel()
{
    __shared__ int my_val[nTPB*ELEMS_PER_THREAD];
    using namespace cub;
    // Specialize BlockRadixSort collective types
    typedef BlockRadixSort<int, nTPB, ELEMS_PER_THREAD> my_block_sort;
    // Allocate shared memory for collectives
    __shared__ typename my_block_sort::TempStorage sort_temp_stg;

    // need to extend synthetic data for ELEMS_PER_THREAD > 1
    my_val[threadIdx.x*ELEMS_PER_THREAD]  = (threadIdx.x + 5)%nTPB; // synth data
    __syncthreads();
    printf("thread %d data = %d\n", threadIdx.x,  my_val[threadIdx.x*ELEMS_PER_THREAD]);

    // Collectively sort the keys
    my_block_sort(sort_temp_stg).Sort(*static_cast<int(*)[ELEMS_PER_THREAD]>(static_cast<void*>(my_val+(threadIdx.x*ELEMS_PER_THREAD))));
    __syncthreads();

    printf("thread %d sorted data = %d\n", threadIdx.x,  my_val[threadIdx.x*ELEMS_PER_THREAD]);
}

int main(){
    BlockSortKernel<<<1,nTPB>>>();
    cudaDeviceSynchronize();

}

This seems to work correctly for me, in this case I happened to be using RHEL 5.5/gcc 4.1.2, CUDA 6.0 RC, and CUB v1.2.0 (which is quite recent).

The strange/ugly static casting is needed as far as I can tell, because the CUB Sort is expecting a reference to an array of length equal to the customization parameter ITEMS_PER_THREAD(i.e. ELEMS_PER_THREAD):

   __device__ __forceinline__ void Sort(
        Key     (&keys)[ITEMS_PER_THREAD],          
        int     begin_bit   = 0,                   
        int     end_bit     = sizeof(Key) * 8)      
   { ...

What happen if I have a 1024 threads block, but tell BlockRadixSort that TPB is 512? Will it only use first 512 threads to sort the data? — yidiyidawu, Mar 02 '14 at 11:11
That's not how this code works. That would not be one key per thread, which was what your question indicated. The second customization parameter for `BlockRadixSort` is "BLOCK_THREADS" which is the number of threads per block. — Robert Crovella, Mar 02 '14 at 12:43
Btw, just a quick question: do we have to use the two "__syncthreads" here? It looks like the Sort function only takes as input the values prepared by the thread itself? — shaoyl85, Mar 04 '14 at 00:37
My thinking was: Sorting is basically swapping values all over the place. So yes a given thread may "take in" it's own input values, but as soon as the swapping starts, *all* of the data had better be valid. Therefore, the first sync seems necessary, so that all of the synthetic data gets populated and stored in shared mem, before any of the swapping begins. Now, I'm treating the BlockRadixSort as opaque. If I look inside it and discover that the first thing it does is a sync, then obviously my first sync is redundant. The second sync follows a similar logic to protect subsequent code. — Robert Crovella, Mar 04 '14 at 04:55
Also, as another hint, you'll note that the [sample code from which my answer is derived](http://nvlabs.github.io/cub/index.html) has the BlockRadixSort bracketed by `__syncthreads()`. This suggests to me that there are no internal redundant `__syncthreads()`, and that they are in fact necessary to make sure 1. the data is populated before the swapping begins, and 2. the swapping is done before the results are consumed. — Robert Crovella, Mar 04 '14 at 05:03

Making CUB blockradixsort on-chip entirely?

1 Answers1

Linked

Related