0
    pos = calloc(nbodies, sizeof(*pos));
    forces = calloc(nbodies, sizeof(*forces));
    //...more...
    printf("Calculating......\n");
    ene = 0.0;

    #pragma omp parallel shared(pos,forces,ene,i)
    {
        #pragma omp for private(j,k,d,d2,d3,rij)
        for(i=0; i<nbodies; ++i){
            for(j=i+1; j<nbodies; ++j) {
                d2 = 0.0;
                for(k=0; k<3; ++k) {
                    rij[k] = pos[i][k] - pos[j][k];
                    d2 += rij[k]*rij[k];
                }
                if (d2 <= cut2) {
                   d = sqrt(d2);
                   d3 = d*d2;
                   for(k=0; k<3; ++k) {
                        double f = -rij[k]/d3;
                        forces[i][k] += f;
                        #pragma omp atomic
                        forces[j][k] -= f;
                   }
                   #pragma omp atomic
                   ene += -1.0/d; 
               }
            }
        }
    }

. . . . . . . . Im using 2 threads for my parallel code and DevCpp program and OpenMP. My Parallel OpenMP C code runs at the same speed or much slower than the serial one! Is there any solution?

  • 1
    you can use the reduction clause for the ene variable and for the arrays you can use an array per thread to avoid the synchronization cost of the pragma omp atomic. Then outside of the parallel region reduce the forces into a single array. – dreamcrash Dec 07 '21 at 16:18
  • False-sharing may not help too, so it may be better to work on a *local copy* of the `forces` array and then perform a reduction to be faster. – Jérôme Richard Dec 07 '21 at 18:07
  • In other words, in place of atomic you should use reduction for both `ene` and `forces`. There is no need to create local array manually, since that is exactly what a reduction would do anyway. – Qubit Dec 08 '21 at 08:23
  • @Qubit Yep, exactly something similar to https://github.com/dreamcrash/ScholarShipCode/blob/a1feec2f90b4a05238417038a2a78165d22eb07c/ThesisCaseStudies/C/MD/SM/DataRedundancyApproach/OpenMPReductions/ParticlesSoA.c#L121 – dreamcrash Dec 08 '21 at 08:34

1 Answers1

0

Introducing synchronization always has an overhead. But you only need this because you're trying to save a couple of operations. Ask yourself, is a factor of 2 work savings important when you have tens of cores to make the work parallel?

So maybe you should make the code a little more wasteful in scalar terms, meaning compute forces for all i,j, but more easily parallelized.

Victor Eijkhout
  • 5,088
  • 2
  • 22
  • 23