I am optimising a pycuda / thrust program. In it, I use thrust::min_element to identify the index of the minimum element in an array that is on the device.
Using Nvidia's visual profiler, it appears that whenever I call thrust::min_element
, there is a DtoH (device to host) memcpy. What I would like is for everything to be conducted only on the device. In other words, the output of min_element() should be stored on the device, where I can use it later, without suffering the cost of the small DtoH memcpy. Is there a way to do this? Or am I thinking about things the wrong way?
My attempt to do this is below, where the idea is to place the index of the smallest element in the array pointed at by input_ptr
into the first element of the array pointed to by output_ptr
. Everything should be done on the device, nothing on the host.
This code produces the right answer, but involving unwanted memcpys. Many thanks in advance for any help you can provide.
#include <thrust/extrema.h>
#include <thrust/device_vector.h>
#include <cuda.h>
void my_min_element(CUdeviceptr input_ptr, int length, CUdeviceptr output_ptr)
{
thrust::device_ptr<float> i_ptr((float*)input_ptr);
thrust::device_ptr<int> o_ptr((int*)output_ptr);
o_ptr[0] = thrust::distance(i_ptr,thrust::min_element(i_ptr, i_ptr+length));
}