How to perform basic operations (+ - * /) on GPU and store the result on it

Question

I have the following code line, gamma is a CPU variable, that after i will need to copy to GPU. gamma_x and delta are also stored on CPU. Is there any way that i can execute the following line and store its result directly on GPU? So basically, host gamma, gamma_x and delta on GPU and get the output of the following line on GPU. It would speed up my code a lot for the lines after. I tried with magma_dcopy but so far i couldn't find a way to make it working because the output of magma_ddot is CPU double.

gamma = -(gamma_x[i+1] + magma_ddot(i,&d_gamma_x[1],1,&(d_l2)[1],1, queue))/delta;

score 1 · Answer 1 · edited Jun 20 '20 at 09:12

The very short answer is no, you can't do this, or least not if you use magma_ddot.

However, magma_ddot is itself a only very thin wrapper around cublasDdot, and the cublas function fully supports having the result of the operation stored in GPU memory rather than returned to the host.

In theory you could do something like this:

// before the apparent loop you have not shown us:
double* dotresult;
cudaMalloc(&dotresult, sizeof(double));

for (int i=....) { 
    // ...

    // magma_ddot(i,&d_gamma_x[1],1,&(d_l2)[1],1, queue);
    cublasSetPointerMode( queue->cublas_handle(), CUBLAS_POINTER_MODE_DEVICE);
    cublasDdot(queue->cublas_handle(), i, &d_gamma_x[1], 1, &(d_l2)[1], 1, &dotresult);
    cudaDeviceSynchronize();
    cublasSetPointerMode( queue->cublas_handle(), CUBLAS_POINTER_MODE_HOST);

    // Now dotresult holds the magma_ddot result in device memory

    // ...

}

Note that might make Magma blow up depending on how you are using it, because Magma uses CUBLAS internally and how CUBLAS state and asynchronous operations are handled inside Magma are completely undocumented. Having said that, if you are careful, it should be OK.

To then execute your calculation, either write a very simple kernel and launch it with one thread, or perhaps use a simple thrust call with a lambda expression, depending on your preference. I leave that as an exercise to the reader.

Thanks @talonmies. Your answer was helpful! However this is quite slower when compared with the CPU code. I'm a newbie, so sorry for the question, but what do you mean with lauching the kernel once? I'm doing this inside the loop, is this wrong? Is this what is decreasing the performance? `mykernel<<<1,1>>>(d_delta, &d_gamma_x[i+1], dotresult, d_gammaOutput);` — eldev09, May 10 '19 at 19:14
@eldev09: you have not misunderstood me, that was poorly worded on my behalf. I have edited it for clarity — talonmies, May 11 '19 at 13:02

How to perform basic operations (+ - * /) on GPU and store the result on it

1 Answers1