I implemented the same algorithm on CPU using C++ and on GPU using CUDA. In this algorithm I have to solve an integral numerically, since there are no analytic answer to it. The function I have to integrate is a weird polynomial of a curve and at the end there is an exp function.
In C++
for(int l = 0; l < 200; l++)
{
integral = integral + (a0*(1/(r_int*r_int)) + a1*(1/r_int) + a2 + a3*r_int + a4*r_int*r_int + a5*r_int*r_int*r_int)*exp(-a6*r_int)*step;
r_int = r_int + step;
}
In CUDA
for(int l = 0; l < 200; l++)
{
integral = integral + (a0*(1/(r_int*r_int)) + a1*(1/r_int) + a2 + a3*r_int + a4*r_int*r_int + a5*r_int*r_int*r_int)*__expf(-a6*r_int)*step;
r_int = r_int + step;
}
Output:
CPU: dose_output=0.00165546
GPU: dose_output=0.00142779
I think that the exp
function of math.h and the __expf
function of CUDA are not calculating the same thing. I tried to remove the --use_fast_math compiler flag thinking that it was the cause, but it seems that both implementation are diverging by around 20%.
I'm using CUDA to accelerate medical physics algorithms and these kind of differences are not very good since I have to proove that one of the outputs is "more true" than the other, and it could obviously be catastrophic for patients.
Does the difference comes from the function itself? Otherwise, I'm thinking that it might come from the memcopy of the a_i
factors or the way I fetch them.
Edit: "Complete" code
float a0 = 5.9991e-04;
float a1 = -1.4694e-02;
float a2 = 1.1588;
float a3 = 4.5675e-01;
float a4 = -3.8617e-03;
float a5 = 3.2066e-03;
float a6 = 4.7050e-01;
float integral = 0.0;
float r_int = 5.0;
float step = 0.1/200;
for(int l = 0; l < 200; l++)
{
integral = integral + (a0*(1/(r_int*r_int)) + a1*(1/r_int) + a2 + a3*r_int + a4*r_int*r_int + a5*r_int*r_int*r_int)*exp(-a6*r_int)*step;
r_int = r_int + step;
}
cout << "Integral=" << integral << endl;
I would suggest running this part both on a gpu and a cpu. Values from Carleton's seed database