I have recently been running into performance issues when using the Thrust
library. These come from thrust allocating memory in the base of a large nested loop structure. This is obviously unwanted, with ideal execution using a pre-allocated slab of global memory. I would like to remove or improve the offending code through one of three ways:
- Implementing a custom thrust memory allocator
- Replacing the thrust code with CUB code (with pre-allocated temp storage)
- Write a custom kernel to do what I want
Although the third option would be my normal preferred choice, the operation that I want to perform is a copy_if
/select_if
type operation where both the data and indexes are returned. Writing a custom kernel would likely be reinventing the wheel and so I would prefer to go with one of the other two options.
I have been hearing great things about CUB, and so I see this as an ideal chance to use it in anger. What I would like to know is:
How would one implement a CUB select_if
with returned indexes?
Can this be done with an ArgIndexInputIterator
and a functor like so?
struct GreaterThan
{
int compare;
__host__ __device__ __forceinline__
GreaterThan(int compare) : compare(compare) {}
__host__ __device__ __forceinline__
bool operator()(const cub::ArgIndexInputIterator<int> &a) const {
return (a.value > compare);
}
};
with the following in the main body of the code:
//d_in = device int array
//d_temp_storage = some preallocated block
int threshold_value;
GreaterThan select_op(threshold_value);
cub::ArgIndexInputIterator<int> input_itr(d_in);
cub::ArgIndexInputIterator<int> output_itr(d_out); //????
CubDebugExit(DeviceSelect::If(d_temp_storage, temp_storage_bytes, input_itr, output_itr, d_num_selected, num_items, select_op));
Will this try and do any memory allocation under the hood?
EDIT:
So going off Robert Crovella's comment, the functor should take the product of dereferencing a cub::ArgIndexInputIterator<int>
, which should be a cub::ItemOffsetPair<int>
making the functor now:
struct GreaterThan
{
int compare;
__host__ __device__ __forceinline__
GreaterThan(int compare) : compare(compare) {}
__host__ __device__ __forceinline__
bool operator()(const cub::ItemOffsetPair<int,int> &a) const {
return (a.value > compare);
}
};
and in the code, d_out
should be a device array of cub::ItemOffsetPair<int,int>
:
//d_in = device int array
//d_temp_storage = some preallocated block
cub::ItemOffsetPair<int,int> * d_out;
//allocate d_out
int threshold_value;
GreaterThan select_op(threshold_value);
cub::ArgIndexInputIterator<int,int> input_itr(d_in);
CubDebugExit(DeviceSelect::If(d_temp_storage, temp_storage_bytes, input_itr, d_out, d_num_selected, num_items, select_op));