-2

I was trying to improve the performance of a slow code. That code used cblas and i was trying to upgrade the performance by using magma and cuda. First i just passed cblas calls to magma. But it needs CPU <-> GPU copies inside the loop and so it made the program run even slower than the cblas version. Then, and thanks to a suggestion of a stackoverflow member, i started using a cuda kernel because this way i could have 1 copy less, which improved the performance a bit. However, my code is still much slower than the CPU code. Is it caused by calling the kernel inside the loop? Is there a way to avoid all CPU <-> GPU copies that are inside the loop? I'm starting to think that maybe this code is just not worth to parelelize.

Here is my code:

__global__ void calculateGamma(double* d_delta, double *d_gamma_xi, double *dotresult, double* gamma_output) {

  int index= blockIdx.x;
  gamma_output[index] = -(*d_gamma_xi + *dotresult)/ *d_delta;
}

for (i=0;i<m-1;i++) {
      if (i==0) {
        gamma = -gamma_x[i+1]/delta;
        cudaMemcpy(d_gammaOutput, &gamma, sizeof(double), cudaMemcpyHostToDevice);
      } else {

        cublasDdot(h, i, &d_gamma_x[1], 1, &(d_l2)[1], 1, dotresult);
        cudaDeviceSynchronize();
        cublasSetPointerMode(h, CUBLAS_POINTER_MODE_HOST);

        calculateGamma<<<1,1>>>(d_delta, &d_gamma_x[i+1], dotresult, d_gammaOutput);
        cudaMemcpy(get_gamma_output, d_gammaOutput, sizeof(double), cudaMemcpyDeviceToHost);

        gamma = *get_gamma_output;
        magma_dcopy(i, &(d_l2)[1], 1, &(d_l1)[2], 1, queue);
        magma_daxpy(i, gamma, &(d_l2)[1], -1, &(d_l1)[2], 1, queue);

        magma_dswap(ny, d_l1, 1, d_l2, 1, queue);
      }
      magma_dcopy(1, d_gammaOutput, 1, &(d_l2)[1], 1, queue);
      delta = gamma_x[0] + magma_ddot(i+1,&d_gamma_x[1],1,&(d_l2)[1],-1, queue);      

      ln_determinant_C += log(delta);
}
talonmies
  • 70,661
  • 34
  • 192
  • 269
eldev09
  • 39
  • 8
  • 1
    Hopefully that isn't the actual code you are running, because it is broken. But I'm really puzzled. In the last question you were insistent that the first dot call in the loop was a huge bottleneck, and if you could only do it on the GPU, you would unlock performance. But it turns out, the next thing you do is copy that result to the host. That is a poor idea to say the least – talonmies May 13 '19 at 10:09
  • Hi @talonmies. It's not broken, it works! You're right, by following your suggestion and using a kernel and cublasDdot the program runs faster than before cause i reduced one copy, however it's still slower than on CPU. But the problem was not the code, i tried it in a more capable GPU and the very same code, is so much faster than the CPU version! My GPU is a very week one... That was the reason. I'm gonna reply explaining that. Thanks for your help! – eldev09 May 13 '19 at 19:22
  • There is a missing `cublasSetPointerMode` call from the code in my answer to your last question which should, in theory, make this code not work – talonmies May 14 '19 at 05:13
  • Oh yeah, it's there, but before the loop. I didn't wrote it here. – eldev09 May 14 '19 at 09:19
  • It should be in the loop before the `cublas_dot` call. If it works, it only works by accident because of the way that unified addressing works on your platform. – talonmies May 14 '19 at 11:18

1 Answers1

1

Update: this code was slower due to my poor GPU. Running it on GPU with better performance made it run much faster than the CBlas version of the code.

eldev09
  • 39
  • 8