How to use thrust min_element algorithm without memcpys between device and host

Question

I am optimising a pycuda / thrust program. In it, I use thrust::min_element to identify the index of the minimum element in an array that is on the device.

Using Nvidia's visual profiler, it appears that whenever I call thrust::min_element, there is a DtoH (device to host) memcpy. What I would like is for everything to be conducted only on the device. In other words, the output of min_element() should be stored on the device, where I can use it later, without suffering the cost of the small DtoH memcpy. Is there a way to do this? Or am I thinking about things the wrong way?

My attempt to do this is below, where the idea is to place the index of the smallest element in the array pointed at by input_ptr into the first element of the array pointed to by output_ptr. Everything should be done on the device, nothing on the host.

This code produces the right answer, but involving unwanted memcpys. Many thanks in advance for any help you can provide.

#include <thrust/extrema.h>
#include <thrust/device_vector.h>
#include <cuda.h>

void my_min_element(CUdeviceptr input_ptr, int length, CUdeviceptr output_ptr)
{
  thrust::device_ptr<float> i_ptr((float*)input_ptr);
  thrust::device_ptr<int> o_ptr((int*)output_ptr);
  o_ptr[0] = thrust::distance(i_ptr,thrust::min_element(i_ptr, i_ptr+length));
}

It seems you've answered your own question. Also, the standalone line of code: `thrust::min_element(i_ptr, i_ptr+length);` is redundant. It's not serving any purpose. — Robert Crovella, Jan 30 '14 at 21:01
Ah yes, oops. That was left over from a debugging session. I will remove it from above. — weemattisnot, Jan 30 '14 at 21:58
@weemattisnot Could you please post your edit as an answer so to remove this question from the unanswered list? — Vitality, Feb 26 '14 at 11:56

score 2 · Accepted Answer · answered Feb 26 '14 at 14:55

I have found a (disappointing) answer to my own question:

I found this quote from someone on the CUDA development team [link]

"I am not a Thrust expert, so take this feedback with a grain of salt; but I think this design element of Thrust deserves to be revisited. Thrust is expressive and useful in ways that sometimes are undermined by the emphasis on returning results to the host. I've had plenty of occasions where I wanted to do an operation strictly in device memory, so Thrust's predisposition toward returning a value to host memory actually got in the way; and if I want results returned to the host, I can always pass in a mapped device pointer (which, if UVA is in effect, means any host pointer that was allocated by CUDA)"

..so it looks like I may be out of luck. If so, what a design flaw in thrust!

score -1 · Answer 2 · answered Jun 07 '14 at 19:10

-1

Im not sure if you are still interested in this, but I believe I have done what you wanted it just casting the CUdeviceptr variable. (And telling thrust to use the device) Here it is with a reduction, and I believe thrust doesnt make any extra copies :)

extern int GPUReduceCudaManage(CUdeviceptr d_data, unsigned int numElements)
{

 thrust::plus<int> binary_op_plus;

 int result = thrust::reduce(thrust::device,
                (int*) d_data,
                (int*) d_data + numElements,
                 0,
                 binary_op_plus);


return result;
}

answered Jun 07 '14 at 19:10

gmm

943
1
17
30

1

No, this is not correct. Thrust makes an implicit device->host memcpy to populate the `result` variable. Clearly your `result` variable is in host code and usable in host code. Clearly the reduction is occuring on the device. Therefore a device->host memcpy *must* be occurring. You won't be able to avoid this by "clever" casting. – Robert Crovella Jun 10 '14 at 15:06
Ok, yeah, in this example where the result is a variable I assumed it would do that copy. But in a exclusive_scan, for example, that all the variables that I give are on the device (inputs and output), would it make a copy as well? That wouldnt make sense, to which host variable, if there is none? And I have tested that this case works, by copying the result out with a cudaMemcpy – gmm Jun 10 '14 at 18:29
Well, I needed to cast them because thrust wont accept CUdeviceptr as an argument, and I only had available (as I thought the question had) the pointer to the device memory. – gmm Jun 10 '14 at 20:12
What do you suppose this question is about? Did you read the question? The OP desires that "the output of min_element() should be stored on the device". Does your answer address that in any way? – Robert Crovella Jun 10 '14 at 20:27

How to use thrust min_element algorithm without memcpys between device and host

2 Answers2