2

I have recently been running into performance issues when using the Thrust library. These come from thrust allocating memory in the base of a large nested loop structure. This is obviously unwanted, with ideal execution using a pre-allocated slab of global memory. I would like to remove or improve the offending code through one of three ways:

  1. Implementing a custom thrust memory allocator
  2. Replacing the thrust code with CUB code (with pre-allocated temp storage)
  3. Write a custom kernel to do what I want

Although the third option would be my normal preferred choice, the operation that I want to perform is a copy_if/select_if type operation where both the data and indexes are returned. Writing a custom kernel would likely be reinventing the wheel and so I would prefer to go with one of the other two options.

I have been hearing great things about CUB, and so I see this as an ideal chance to use it in anger. What I would like to know is:

How would one implement a CUB select_if with returned indexes?

Can this be done with an ArgIndexInputIterator and a functor like so?

struct GreaterThan
{
    int compare;

    __host__ __device__ __forceinline__
    GreaterThan(int compare) : compare(compare) {}

    __host__ __device__ __forceinline__
    bool operator()(const cub::ArgIndexInputIterator<int> &a) const {
        return (a.value > compare);
    }
};

with the following in the main body of the code:

//d_in = device int array
//d_temp_storage = some preallocated block


int threshold_value;
GreaterThan select_op(threshold_value);

cub::ArgIndexInputIterator<int> input_itr(d_in);
cub::ArgIndexInputIterator<int> output_itr(d_out); //????


CubDebugExit(DeviceSelect::If(d_temp_storage, temp_storage_bytes, input_itr, output_itr, d_num_selected, num_items, select_op));

Will this try and do any memory allocation under the hood?

EDIT:

So going off Robert Crovella's comment, the functor should take the product of dereferencing a cub::ArgIndexInputIterator<int>, which should be a cub::ItemOffsetPair<int> making the functor now:

struct GreaterThan
{
    int compare;

    __host__ __device__ __forceinline__
    GreaterThan(int compare) : compare(compare) {}

    __host__ __device__ __forceinline__
    bool operator()(const cub::ItemOffsetPair<int,int> &a) const {
        return (a.value > compare);
    }
};

and in the code, d_out should be a device array of cub::ItemOffsetPair<int,int>:

//d_in = device int array
//d_temp_storage = some preallocated block

cub::ItemOffsetPair<int,int> * d_out;
//allocate d_out

int threshold_value;
GreaterThan select_op(threshold_value);

cub::ArgIndexInputIterator<int,int> input_itr(d_in);
CubDebugExit(DeviceSelect::If(d_temp_storage, temp_storage_bytes, input_itr, d_out, d_num_selected, num_items, select_op));
ebarr
  • 7,704
  • 1
  • 29
  • 40
  • The functor operator should take as its parameter the item returned by *dereferencing* the input iterator -- not the iterator itself. The item returned by dereferencing your input iterator is a `ItemOffsetPair`. You'll need to use that appropriately in your comparison functor. For the output, you will need an iterator which points to the type returned by dereferencing the input iterator - i.e. an array of `ItemOffsetPair`. You cannot just declare an `ArgIndexInputIterator` from an `int` pointer `d_out`. And it should not do any allocation beyond what is required in `d_temp_storage`. – Robert Crovella Mar 19 '14 at 03:47
  • Thanks, I have added in an edit with what I understood from your comment. Is this more or less along the right lines? – ebarr Mar 19 '14 at 04:07
  • I think you should define it as `cub::ItemOffsetPair * d_out;` but I may be wrong. I'm not trying to be coy, but at the moment I'm away from a machine where I can write some code to test something that will work correctly, which is why I provided a comment rather than an answer. And you would probably want to initialize `threshold_value` before the `GreaterThan` functor instantiation. Other than that, try it and see what kind of compile errors you get. I may have missed something. – Robert Crovella Mar 19 '14 at 04:39
  • No worries, any info is good info. So from what I understand from the CUB examples, it is not necessary to set the offset pointer type for `cub::ArgIndexInputIterator` despite the fact that it is templated as `cub::ArgIndexInputIterator< InputIterator, Offset >`. I assume that there is a default `size_t` or `int` for the offset type. I was assuming this would also be the case for `cub::ItemOffsetPair`. However, as explicit is almost always better than implicit, I'll update the question. – ebarr Mar 19 '14 at 04:45
  • I think you need to change one of the uses of `compare` in your functor to `_compare` or something different, anyway. – Robert Crovella Mar 19 '14 at 04:46
  • C++ should be fine with that initialisation list. The first time I saw the same argument names as class/struct attribute names I baulked, but it is completely passable code and I find it makes things more readable. – ebarr Mar 19 '14 at 04:49
  • OK on the functor initialization. Regarding the input/output item types, I'm not sure. Anyway, the item type returned by dereferencing `input_itr` should match the item type to be stored using `output_itr`. I think you can figure that out from compiler feedback. – Robert Crovella Mar 19 '14 at 04:51
  • `output_itr` was a typo, it should have been `d_out` now that its type has been redeclared. – ebarr Mar 19 '14 at 05:39

2 Answers2

3

I have recently been running into performance issues when using the Thrust library. These come from thrust allocating memory in the base of a large nested loop structure. This is obviously unwanted, with ideal execution using a pre-allocated slab of global memory.

Thrust lets you customize how temporary memory is allocated during algorithm execution.

See the custom_temporary_allocation example to see how to create a cache for your pre-allocated slab.

Jared Hoberock
  • 11,118
  • 3
  • 40
  • 76
  • I have already looked into doing this, and it seems quite straight forward, my question is more about how to solve the problem with CUB. The choice of whether I stick with thrust or move to CUB will all depend on which is faster when optimised for memory efficiency. – ebarr Mar 19 '14 at 05:44
  • I can't really accept this as the answer, as it didn't answer the original question, but I will +1 it, as the cached_allocator model is beautifully neat and is extremely simple to implement. Thank you for pointing me towards that. – ebarr Mar 20 '14 at 09:35
3

After some fiddling and asking around, I was able to get a simple code along the lines of what you suggest working:

$ cat t348.cu
#include <cub/cub.cuh>
#include <stdio.h>
#define DSIZE 6

struct GreaterThan
{

    __host__ __device__ __forceinline__
    bool operator()(const cub::ItemOffsetPair<int, ptrdiff_t> &a) const {
        return (a.value > DSIZE/2);
    }
};

int main(){

  int num_items = DSIZE;
  int *d_in;
  cub::ItemOffsetPair<int,ptrdiff_t> * d_out;
  int *d_num_selected;
  int *d_temp_storage = NULL;
  size_t temp_storage_bytes = 0;

  cudaMalloc((void **)&d_in, num_items*sizeof(int));
  cudaMalloc((void **)&d_num_selected, sizeof(int));
  cudaMalloc((void **)&d_out, num_items*sizeof(cub::ItemOffsetPair<int,ptrdiff_t>));

  int h_in[DSIZE] = {5, 4, 3, 2, 1, 0};
  cudaMemcpy(d_in, h_in, num_items*sizeof(int), cudaMemcpyHostToDevice);

  cub::ArgIndexInputIterator<int *> input_itr(d_in);


  cub::DeviceSelect::If(d_temp_storage, temp_storage_bytes, input_itr, d_out, d_num_selected, num_items, GreaterThan());

  cudaMalloc(&d_temp_storage, temp_storage_bytes);

  cub::DeviceSelect::If(d_temp_storage, temp_storage_bytes, input_itr, d_out, d_num_selected, num_items, GreaterThan());
  int h_num_selected = 0;
  cudaMemcpy(&h_num_selected, d_num_selected, sizeof(int), cudaMemcpyDeviceToHost);
  cub::ItemOffsetPair<int, ptrdiff_t> h_out[h_num_selected];
  cudaMemcpy(h_out, d_out, h_num_selected*sizeof(cub::ItemOffsetPair<int, ptrdiff_t>), cudaMemcpyDeviceToHost);
  for (int i =0 ; i < h_num_selected; i++)
    printf("index: %d, offset: %d, value: %d\n", i, h_out[i].offset, h_out[i].value);

  return 0;
}
$ nvcc -arch=sm_20 -o t348 t348.cu
$ ./t348
index: 0, offset: 0, value: 5
index: 1, offset: 1, value: 4
$

RHEL 6.2, cub v1.2.2, CUDA 5.5

Robert Crovella
  • 143,785
  • 11
  • 213
  • 257
  • Fantastic, I was close, but not close enough. In the meantime,thanks to the advice of Jared Hoberock, I have implemented an optimised thrust version to play with and compare against. Let the benchmarking begin! – ebarr Mar 20 '14 at 21:31
  • `cub::ItemOffsetPair` no longer seems to be apart of the cub library (v1.8.0), but the code above should work if it is replaced with `cub::KeyValuePair`, where the order of the template arguments should be flipped (e.g. `cub::ItemOffsetPair` should become `cub::KeyValuePair`). Still a useful answer 6 years later.. :) – zanbri Aug 04 '20 at 14:32