I am lerning how to use OpenMP to make a code use muttiple processors. Recently, I tried to make my Ewald Summation Fourier part parallel using OpenMP. Below is the function named PME_fourier_oprimized which I try to make parallel using OpenMP pragma where,
ParticleN = Number of cahrged particles in the System (an integer number)
kcount = an integer number
qi = prop[i][0] (a double float) which gives charge of a particle i
Ext_L[0] = x dimension of the system (a double float number)
Ext_L[1] = y dimension of the system (a double float number)
Ext_L[1] = z dimension of the system (a double float number)
**r_pos = ParticleNx3 array (r_pos[i][0], r_pos[i][1], and r_pos[i][2] = x, y , and z postions of a particle i)
**PME_mev_ksq = kcountx4 array
**f_kewald = ParticleNx3 array which stores the forces
double PME_Fourier_optimized(int kcount, double **r_pos, double **PME_mvec_ksq, double **prop, double **f_kewald, double *Ext_L){
// Auto alpha determination //
double TRTF = 50*5.5;
double box_VOL = 8*Ext_L[0]*Ext_L[1]*Ext_L[2]; // calculate the volume of the box //
alpha = pow((ParticleN*M_PI*M_PI*M_PI*TRTF)/(box_VOL*box_VOL),1./6.); // Ewald cutoff parameter //
double GAMMA = -0.25/(alpha*alpha);
double recip = 2*M_PI*ONE_PI_EPS0/(8*Ext_L[0]*Ext_L[1]*Ext_L[2]);
double e_kewald=0.0;
double fx_kewald[ParticleN];
double fy_kewald[ParticleN];
double fz_kewald[ParticleN];
#pragma omp parallel for
for (int i=0;i<ParticleN;i++){
fx_kewald[i] = 0.0;
fy_kewald[i] = 0.0;
fz_kewald[i] = 0.0;
for (int j=0;j<dimen;j++){
f_kewald[i][j] = 0.0;
}
}
#pragma omp parallel for reduction(+:e_kewald,fx_kewald[:ParticleN],fy_kewald[:ParticleN],fz_kewald[:ParticleN])
for(int k=0; k<kcount; k++){
double ak_cos=0.0;
double ak_sin=0.0;
double mx = PME_mvec_ksq[k][0];
double my = PME_mvec_ksq[k][1];
double mz = PME_mvec_ksq[k][2];
double ksq = PME_mvec_ksq[k][3];
#pragma omp parallel for reduction(+:ak_sin,ak_cos)
for(int i=0;i<ParticleN;i++){
double qi = prop[i][0]; // charge of particle i //
double kdotr = mx*r_pos[i][0] + my*r_pos[i][1] + mz*r_pos[i][2];
ak_sin -= qi*sin(kdotr);
ak_cos += qi*cos(kdotr);
}
double a = ak_cos;
double b = ak_sin;
double akak = (a*a + b*b);
double tmp = recip * exp(GAMMA*ksq)/ksq;
#pragma omp parallel for reduction (+:fx_kewald[:ParticleN],fy_kewald[:ParticleN],fz_kewald[:ParticleN])
for(int i=0;i<ParticleN;i++){
double qi = prop[i][0];
double kdotr = mx*r_pos[i][0] + my*r_pos[i][1] + mz*r_pos[i][2];
double tmp2 = 2*tmp*qi*(sin(kdotr) * a + cos(kdotr) * b);
//Edit3: Following @Laci advice to use 1D array to store the forces and using the reduction to gain a huge acceleration compared to only using omp critical (See the comment section for more details).
fx_kewald[i] += tmp2 * mx;
fy_kewald[i] += tmp2 * my;
fz_kewald[i] += tmp2 * mz;
}
e_kewald += tmp * akak; // add the energies for each k //
}
//Edit3: Finally storing the calculated values in the 2D array //
#pragma omp parallel for
for(int i=0;i<ParticleN;i++){
f_kewald[i][0] = fx_kewald[i];
f_kewald[i][1] = fy_kewald[i];
f_kewald[i][2] = fz_kewald[i];
}
//printf("PME KEwald: %lf\n", e_kewald);
return e_kewald;
}
I compared this code output with my serial code version and notied there is an error in the e_kewald energy and f_kewald force after 2 decimal places. Interestingly, I get differnt results each time I run it (maybe it is the case of data race).
When I remove the first pragma command ( #pragma omp parallel for private(k, mx, my, mz, ksq, qi, kdotr, tmp2) reduction(+:e_kewald) ) before the outer k loop, it starts to work perfectly and matches perfectly with the serial code result. Is there something that I am doing grossly wrong? I am also not sure if the way I did use pargma is correct or not. Any help is highly appreciated.
Edit1: After using the #pragma omp critical before the f_kewald array solves the problem. But, not sure if it's the best way to do it.
My 2d arrays are defined using the following function:
double **alloc_2d_double(int rows, int cols){
int i=0;
double *data = (double *)malloc(rows*cols*sizeof(double));
double **array= (double **)malloc(rows*sizeof(double *));
for (i=0; i<rows; i++)
array[i] = &(data[cols*i]);
return array;
}
I am still wondering if the f_kewald[i][0], f_kewald[i][1], f_kewald[i][2] can be use under the reduction.
Edit2: Declaring the variables inside the loop and putting the braces in the omp critical section.
ParticleN ~ 5000-10000 kcount ~ 2000
Edit3: Using 1D arrays (fx_kewald, fy_kewald, fz_kewald) with reduction to gain enhanced parallelization speed instead of a opm critical section with 2D arrays. (Thanks to the @Laci's comment below)