Problem with outer loop vectorization using OpenMP in C

Question

I am lerning how to use OpenMP to make a code use muttiple processors. Recently, I tried to make my Ewald Summation Fourier part parallel using OpenMP. Below is the function named PME_fourier_oprimized which I try to make parallel using OpenMP pragma where,

ParticleN = Number of cahrged particles in the System (an integer number)
kcount = an integer number
qi = prop[i][0] (a double float) which gives charge of a particle i
Ext_L[0] = x dimension of the system (a double float number)
Ext_L[1] = y dimension of the system (a double float number)
Ext_L[1] = z dimension of the system (a double float number)
**r_pos = ParticleNx3 array (r_pos[i][0], r_pos[i][1], and r_pos[i][2] = x, y , and z postions of a particle i)
**PME_mev_ksq = kcountx4 array
**f_kewald = ParticleNx3 array which stores the forces

 double PME_Fourier_optimized(int kcount, double **r_pos, double **PME_mvec_ksq, double **prop, double **f_kewald, double *Ext_L){
    
    // Auto alpha determination //
    double TRTF = 50*5.5;
    double box_VOL = 8*Ext_L[0]*Ext_L[1]*Ext_L[2]; // calculate the volume of the box //
    alpha = pow((ParticleN*M_PI*M_PI*M_PI*TRTF)/(box_VOL*box_VOL),1./6.); // Ewald cutoff parameter //

    double GAMMA = -0.25/(alpha*alpha);
    double recip = 2*M_PI*ONE_PI_EPS0/(8*Ext_L[0]*Ext_L[1]*Ext_L[2]);
    double e_kewald=0.0;
    
    double fx_kewald[ParticleN];
    double fy_kewald[ParticleN];
    double fz_kewald[ParticleN];

    #pragma omp parallel for
    for (int i=0;i<ParticleN;i++){
        fx_kewald[i] = 0.0;
        fy_kewald[i] = 0.0;
        fz_kewald[i] = 0.0;
        for (int j=0;j<dimen;j++){
            f_kewald[i][j] = 0.0; 
        }
    }
    
    #pragma omp parallel for reduction(+:e_kewald,fx_kewald[:ParticleN],fy_kewald[:ParticleN],fz_kewald[:ParticleN])
    for(int k=0; k<kcount; k++){
        
        double ak_cos=0.0;
        double ak_sin=0.0;
        
        double mx = PME_mvec_ksq[k][0]; 
        double my = PME_mvec_ksq[k][1];
        double mz = PME_mvec_ksq[k][2];
        double ksq = PME_mvec_ksq[k][3];

        #pragma omp parallel for reduction(+:ak_sin,ak_cos)
        for(int i=0;i<ParticleN;i++){
            double qi = prop[i][0]; // charge of particle i //
            double kdotr = mx*r_pos[i][0] + my*r_pos[i][1] + mz*r_pos[i][2];

            ak_sin -= qi*sin(kdotr);
            ak_cos += qi*cos(kdotr);
        }
        
        double a = ak_cos;
        double b = ak_sin;

        double akak = (a*a + b*b);
        double tmp = recip * exp(GAMMA*ksq)/ksq;
        
        #pragma omp parallel for reduction (+:fx_kewald[:ParticleN],fy_kewald[:ParticleN],fz_kewald[:ParticleN])
        for(int i=0;i<ParticleN;i++){
            double qi = prop[i][0];
            double kdotr = mx*r_pos[i][0] + my*r_pos[i][1] + mz*r_pos[i][2];
            double tmp2 = 2*tmp*qi*(sin(kdotr) * a + cos(kdotr) * b);
            
            //Edit3: Following @Laci advice to use 1D array to store the forces and using the reduction to gain a huge acceleration compared to only using omp critical (See the comment section for more details).

            fx_kewald[i] += tmp2 * mx; 
            fy_kewald[i] += tmp2 * my; 
            fz_kewald[i] += tmp2 * mz;
        }
        
        e_kewald += tmp * akak; // add the energies for each k //
    }
    
    //Edit3: Finally storing the calculated values in the 2D array //
    #pragma omp parallel for
    for(int i=0;i<ParticleN;i++){
        f_kewald[i][0] = fx_kewald[i];
        f_kewald[i][1] = fy_kewald[i];
        f_kewald[i][2] = fz_kewald[i];
    }

    //printf("PME KEwald: %lf\n", e_kewald);
    return e_kewald; 
}

I compared this code output with my serial code version and notied there is an error in the e_kewald energy and f_kewald force after 2 decimal places. Interestingly, I get differnt results each time I run it (maybe it is the case of data race).

When I remove the first pragma command ( #pragma omp parallel for private(k, mx, my, mz, ksq, qi, kdotr, tmp2) reduction(+:e_kewald) ) before the outer k loop, it starts to work perfectly and matches perfectly with the serial code result. Is there something that I am doing grossly wrong? I am also not sure if the way I did use pargma is correct or not. Any help is highly appreciated.

Edit1: After using the #pragma omp critical before the f_kewald array solves the problem. But, not sure if it's the best way to do it.

My 2d arrays are defined using the following function:

double **alloc_2d_double(int rows, int cols){
    int i=0;
    double *data = (double *)malloc(rows*cols*sizeof(double));
    double **array= (double **)malloc(rows*sizeof(double *));
    for (i=0; i<rows; i++)
        array[i] = &(data[cols*i]);

    return array;
}

I am still wondering if the f_kewald[i][0], f_kewald[i][1], f_kewald[i][2] can be use under the reduction.

Edit2: Declaring the variables inside the loop and putting the braces in the omp critical section.

ParticleN ~ 5000-10000 kcount ~ 2000

Edit3: Using 1D arrays (fx_kewald, fy_kewald, fz_kewald) with reduction to gain enhanced parallelization speed instead of a opm critical section with 2D arrays. (Thanks to the @Laci's comment below)

Make all your loop variables declared in the loop header. That literally solves half of all questions posted about OpenMP. — Victor Eijkhout, Feb 23 '23 at 04:39
All that `**` stuff is not a good way to program. You're not giving the code so I can't see if you create the arrays correctly, but let's assume so. For memory locality it's a better idea to allocate 2D arrays as a contiguous block. — Victor Eijkhout, Feb 23 '23 at 04:42
Just a silly question of mine: you're doing nested parallelism here. Is that intentional? How are you managing it? — Gilles, Feb 23 '23 at 07:44
@VictorEijkhout I have edited my question to include how I made the 2d arrays, but not sure if the memory is allocated as a contiguous block or not. — Seth, Feb 23 '23 at 16:49
@Gilles I tried the nested parallelism to gain speed up, but after putting #pragma omp critical, not sure whether the nested things are at all needed. — Seth, Feb 23 '23 at 16:52
@Seth Your arrays are indeed allocated as pointers into one contiguous block. Good. You should still declare the loop variables in the loop headers, rather than using `private`. — Victor Eijkhout, Feb 23 '23 at 17:01
Note that `#pragma omp critical` protects the next line only (without braces), you still have data race on the next 2 lines. It is much better to use reduction (if possible) or use atomic operations. How big is `ParticleN`? — Laci, Feb 23 '23 at 21:08
@ Laci Thanks for pointing out the braces problem. I tried to use #pragma omp parallel for reduction(+:f_kewald[:ParticleN][:3]) but didn't work saying that "the array section is not contiguous". — Seth, Feb 23 '23 at 23:25
@ Victor Thanks for mentioning the proper way of declaring the variables, I have now modified the code. — Seth, Feb 23 '23 at 23:27
if `f_kewald` is allocated by `alloc_2d_double`, then it is indeed continuous in memory, but you have to access it as a 1D array. Note that your stack has to be big enough to store a local copy for all the threads. From performance point of view it would be much better to use one dimensional arrays. — Laci, Feb 24 '23 at 06:48
So, my suggestion is to allocate a `ParticleNx3`-sized 1D array for `f_kewald`. In this case, reduction can be used, and, more importantly, the compiler can properly vectorize your code. Another suggestion is to use `const` variables wherever possible. It may help the compiler to better optimize your code, especially in the case of parallel code using shared variables. — Laci, Feb 24 '23 at 08:04
@ Laci Now, I have used three 1D arrays and reduction over those, which gives me a better speed up than using the omp critical section. I have modified the code accordingly. Thank you very much for your suggestion! — Seth, Feb 25 '23 at 00:45
After the edits, the question contains an answer to itself. Probably not useful to anyone reading it in future. — anatolyg, Feb 26 '23 at 14:35

score 1 · Accepted Answer · answered Feb 23 '23 at 17:11

A simplified view of your code:

    #pragma omp parallel for ...
    for(k=0; k<kcount; k++){
            ...
            f_kewald[i][0] += ...
            ...
    }

It's clear that different threads will update f_kewald[i][0] simultaneously, leading to errors. This happens for all i.

When you stuff all your code updating f_kewald into a critical section, this helps — now threads wait for each other. But still the updates are in arbitrary order, and the critical section doesn't discriminate between different values of i.

I suggest you parallelize your calculations only at one level - decide whether you want to do it at the outer loop, or at the inner loops, not both. Assuming your ParticleN is big enough, you don't need to parallelize your outer loop. This is the easy case.

If your ParticleN can be small, better parallelize your outer loop, but then you have to fix the synchronization problem of f_kewald somehow. Some ideas:

Switch your loops inside-out
Break your outer loop into several pieces, each of which contains only one inner loop

This is a bad situation — you have to tweak your code, and it may be hard or impossible. If you can avoid it at all (see above), please do!

Regardless of anything, multiple people advised not using private to control allocation of variables to threads. This is good advice! You should declare your variables in the proper C scope instead of using private. This will make it easier to transform your code (if you ever need to), and will prevent unpleasant surprises (when you forget just one variable in your giant private list).

In my case ParticleN ~ 5000 and kcount ~2000, in which case I assume the inner loop parallelization will work fine. I have now removed the variables from private and declared them inside. However, is there a way to not use the omp critical and use reduction in f_kewald[i][0]? — Seth, Feb 23 '23 at 23:34

Problem with outer loop vectorization using OpenMP in C

1 Answers1