How can I speed up a code C++ using OpenMP?

Question

I am trying to parallelize the following code C++ using OpenMP:

int np = 1000000;
double kk = 1 / pow(2 * pi, 2);
for (int kes = 1; kes <= 100; kes++) {
  double E1 = 0;
  #pragma omp parallel for reduction(+: E1)
  for (int ies = 0; ies < np; ies++) {
    for (int jes = 0; jes < np; jes++) {
      if (ies != jes) {
        float distanes = sqrt(pow(xp[ies] - xp[jes], 2) + pow(yp[ies] - yp[jes], 2) + pow(zp[ies] - zp[jes], 2));
        float distan = kes * distanes;
        if (distan <= 5) {
          float gpspec = gpx[ies] * gpx[jes] + gpy[ies] * gpy[jes] + gpz[ies] * gpz[jes];
          E1 = E1 + kk * gpspec * sin(kes * distanes) / (kes * distanes);
        }
      }
    }
  }
  Ees[kes] = exp(-kes * pow(sp, 2) / 2) * E1;
}

This code is parallelized. However, the calculation time is still terrible. How can I speed up this calculation with n^2 operations? xp, yp, zp, gpx, gpy, gpz are 1-dimensional vector.

For starters, instead of calling `pow(x, 2)` you could just do `x * x`. And instead of taking the square root of `distanes` _before_ doing the less-than-five test, you could use the squared value and test it's less than 25. Then actually use your pre-calculated value for `kes * distanes` instead of duplicating that code and stressing the compiler out. Use sin-tables instead of invoking `sin` directly (if an approximation is sufficient). — paddy, Oct 13 '20 at 07:19
Depending upon how distributed your data is, `if(abs(xp[ies] -xp[jes])<5) continue; // similarly for other two dimensions` may be helpful. — Tanveer Badar, Oct 13 '20 at 07:20
You might also gain something from pre-generating all the square-distances into an array, since that's actually a vectorizable operation. — paddy, Oct 13 '20 at 07:24
@paddy thank you very much. Size of 1D vector xp, yp, zp is 10^6. If I store square-distances using a 2D array, its size will be 10^12. This array size is too large for memory. — Luc Nguyen, Oct 13 '20 at 07:28
Well, you know you're always calculating these distances twice, right? Why not remove the `ies != jes` test in the loop, and start the `jes` loop from `ies + 1`? Then you can simply multiply the final value of `E1` by 2 after finishing the loops. That alone cuts your execution time in half. And like @TanveerBadar suggested, _any_ way to short-circuit unnecessary multiplications would be helpful, if most of your data will not pass the distance test, that is. — paddy, Oct 13 '20 at 07:33
I would personally move the parallel pragma to the `kes` loop, too. Especially with the above optimization. Unless you're running on a machine with hundreds of cores, I suppose. If your compiler can't vectorize the mathematics inside the inner loop, you might need to roll your own custom SIMD solution, which can potentially give you an additional massive performance boost, especially with modern AVX. — paddy, Oct 13 '20 at 07:37
And reading the excellent suggestions by @paddy, I have one more to add to the mix. Have you considered GPGPU instead of OpenMP? — Tanveer Badar, Oct 13 '20 at 07:42
@paddy and Tanveer Badar, thank you very much for your above nice suggestions. Parallelize the "kes" loop needs hundreds of cores. Unfortunately, my pc has only 40 cores. SIMD and GPGPU are also solutions. I will search to understand them further. — Luc Nguyen, Oct 13 '20 at 07:53
It might be a bit tricky to organise, but you are computing distanes, which doesn't depend on kes, 100 times, once per value of kes. If you could reorganise the code so that the kes loop was the inner one, that could be a big gain. On a practical matter, if you could post something that readers could compile, you might get better answers — dmuir, Oct 13 '20 at 07:57
What is the probability for `distan <= 5` to be true in practice? Is it mostly true or mostly false or unpredictable? — Jérôme Richard, Oct 13 '20 at 08:40
@JérômeRichard when kes is small; it is mostly true. However, when kes is higher, it is mostly false. xp, yp, zp range from 0, pi/2. — Luc Nguyen, Oct 13 '20 at 08:46

paddy · Accepted Answer · 2020-10-13T22:55:24.073

There is no real answer to this question, but I'd like to distill some of the more important optimizations discussed in the comments. Let's focus on just the inner loops.

Primarily, you need to avoid excessive multiplications and function calls. And there are some tricks that aren't guaranteed to be optimized by compilers. For example, we know intuitively that pow(x, 2) just squares a value, but if your compiler doesn't optimize this, then it's much less efficient than simply x * x.

Further, it was identified that the O(N²) loop actually can be reduced to O(N²/2) because distances are symmetric. This is a big deal, if you're calling expensive things like pow and sqrt. You can just scale the final result of E1 by 2 to compensate for halving the number of calculations.

And on the subject of sqrt, it was also identified that you don't need to do that before your distance test. Do it after, because the test sqrt(d) < 5 is the same as d < 25.

Let's go even further, beyond the comments. Notice that the < 5 test actually relies on a multiplication involving kes. If you precomputed a distance-squared value that also incorporates the kes scaling, then you have even fewer multiplications.

You can also remove the kk value from the E1 calculation. That doesn't need to happen in a loop... probably. By that, I mean you're likely to have floating point error in all these calculations. So every time you change something, your final result might be slightly different. I'm gonna do it anyway.

So... After that introduction, let's go!

for (kes = 1; kes <= 100; kes++)
{
  double E1 = 0;
  const float distance_sq_thresh = 25.0f / kes / kes;

#pragma omp parallel for reduction(+: E1)
  for (ies = 0; ies < np; ies++)
  {
    for (jes = ies+1; jes < np; jes++)
    {
      float dx = xp[ies] - xp[jes];
      float dy = yp[ies] - yp[jes];
      float dz = zp[ies] - zp[jes];

#if 0
      // From Tanveer Badar's suggestion, if distances are generally large.
      // This may be detrimental for large values of kes.
      if (abs(dx) > distance_sq_thresh ||
          abs(dy) > distance_sq_thresh ||
          abs(dz) > distance_sq_thresh)
      {
        continue;
      }
#endif

      float distance_sq = dx * dx + dy * dy + dz * dz;
      if (distance_sq <= distance_sq_thresh)
      {
        float distan = kes * sqrt(distance_sq);
        float gpspec = gpx[ies] * gpx[jes] + gpy[ies] * gpy[jes] + gpz[ies] * gpz[jes];
        E1 = E1 + gpspec * sin(distan) / distan;
      }
    }
  }

  Ees[kes] = exp(-kes * pow(sp, 2) / 2) * E1 * kk * 2.0f;
}

Now, this I guarantee will be faster than your existing code by a significant amount.

If you want to go further, in each kes loop, you could pre-generate a large table of values for sin(distan) / distan for the possible range, and just index into it when you need it. Generally, trig operations and divisions are slow. So if you can work out your acceptable error tolerance and create a sufficiently large pre-calculated table, this might be a good optimization too.

Update

You have posted an answer taking user dmuir's suggestion of running the kes loop as the inner loop to avoid repeating expensive calculations. However, in the process you also abandoned some of the principles that I laid out in my answer. I commented there, but let me just put them down in code for you.

First, pre-calculate the squared-distance thresholds:

const double max_distance = 5.0;
double distance_sq_thresh[101] = {0};
for (kes = 1; kes <= 100; kes++)
{
    distance_sq_thresh[kes] = max_distance * max_distance / kes / kes;
}

Now, the main section

const int np = 1000000;

for (ies = 0; ies < np; ies++){
    for (jes = ies+1; jes < np; jes++){
        double dxp = xp[ies] - xp[jes];
        double dyp = yp[ies] - yp[jes];
        double dzp = zp[ies] - zp[jes];
        double distance_sq = dxp * dxp + dyp * dyp + dzp * dzp;

        // If the first kes iteration won't pass the test, none will
        if (distance_sq > distance_sq_thresh[1])
            continue;

        const double distance = sqrt(distance_sq);
        const double gpspec = gpx[ies] * gpx[jes] + gpy[ies] * gpy[jes] + gpz[ies] * gpz[jes];

        // We can keep adding distance to 'distan' instead of multiplying in a loop
        double distan = 0.0;

        for (kes = 1; kes <= 100; kes++){
            // We know that the threshold decreases as kes increases, so break early
            if (distance_sq > distance_sq_thresh[kes])
                break;

            E1[kes] = E1[kes] + gpspec * sin(distan) / distan;
            distan += distance;
        }
    }
}

And finally applying the results. Since there's only 100, there's not really any point in making this parallel.

const double kk = 1.0 / pow(2.0 * pi, 2.0);
for (kes = 1; kes <= 100; kes++){
    Ees[kes] = exp(-kes * pow(sp, 2) / 2) * E1[kes] * kk * 2.0f;
}

You can see that with this approach, you're no longer able to parallelize the Ees array calculation because multiple threads might be writing to the same element simultaneously. Maybe OpenMP has a directive allowing some kind of array reduction. I don't know. Perhaps you can go looking.

Alternatively, you could allocate one of these arrays for each OpenMP thread and use the thread's ordinal to select the array. Then add the arrays together at the end. Essentially rolling your own array reduction.

Or you could even go ghetto, by defining a local array for E inside the outer-most loop, knowing that is local to the current OpenMP worker. Then, after the jes loop, you can acquire a lock on the main Ees array and update it from these local values. Since the jes loop is doing 1 million expensive calculations, and the lock-and-update is only doing 100 additions, I expect that most of the time there will be no threads blocked from the lock.

I updated the answer to include information about the decision to cut the amount of looping in half and scale `E1` by 2. While doing that, I realized we can also remove the `kk` multiplication, since that's common to all terms. That can happen after the loops. — paddy, Oct 13 '20 at 08:22
This is a very nice suggestion. It speeds up the calculation twice. I will test code with your suggestion. — Luc Nguyen, Oct 13 '20 at 08:31
And, the suggestion of @dmuir seems to be also a nice suggestion. It can be extremely fruitful. However, I do not know to organize loops. — Luc Nguyen, Oct 13 '20 at 08:31

score 0 · Answer 2 · answered Oct 13 '20 at 11:10

0

Based on suggestions by @paddy and @dmuir (put kes loop inner and jes=ies+1), i modified code as below:

double kk = 1 / pow(2 * pi, 2);
int np = 1000000;
//#pragma omp parallel for reduction(+: E1)
for (ies = 0; ies < np; ies++){
    for (jes = ies+1; jes < np; jes++){


        double dxp = xp[ies] - xp[jes];
        double dyp = yp[ies] - yp[jes];
        double dzp = zp[ies] - zp[jes];
        double distance = sqrt( dxp * dxp + dyp * dyp + dzp * dzp );
        double gpspec = gpx[ies] * gpx[jes] + gpy[ies] * gpy[jes] + gpz[ies] * gpz[jes];

        for (kes = 1; kes <= 100; kes++){
        
            double distance_thresh = 5.0/kes;
            if (distance <= distance_thresh){

                double distan = kes * distance;
                E1[kes] = E1[kes] + gpspec * sin(distan) / distan;
            }
        }
}

#pragma omp parallel for
for (kes = 1; kes <= par_spec; kes++){

    Ees[kes] = exp(-kes * pow(sp, 2) / 2) * E1[kes] * kk * 2.0f;
}

It seems to be better. However, how can it be parallelized?

answered Oct 13 '20 at 11:10

Luc Nguyen

101
9

This should be a separate question, not an answer. Nobody will see this, and your question already has an accepted answer. But I will comment anyway. Why did you remove the parallel pragma on the most expensive loops? And you've added back the square root calculation for all values, which may be unnecessary because I expect some pairs will not match _any_ distance. Pay close attention to how I computed and used the threshold in my answer. – paddy Oct 13 '20 at 21:38
You should also see that if you start `kes` at 100 instead of 1, you can step backwards until until your distance check fails and then break out of the kes loop. In fact, you only need to calculate the square root and execute the kes loop if `distance_sq < 25.f * 100.f`. Because if that doesn't pass, none of the smaller thresholds will either. You might also get a performance boost by pre-calculating `distance_sq_thresh` (from my answer) as an array, as this saves potentially trillions of multiplications. – paddy Oct 13 '20 at 21:43
And, much like you calculating `sqrt` too early here, you're also calculating `gpspec` too soon. Use the trick of short-circuiting the entire `kes` loop that I suggested above. You absolutely don't need to do these calculations if none of the `kes` loop tests are going to result in modifying `E1`. – paddy Oct 13 '20 at 21:45
Oh, I noticed that my math was incorrect for the threshold. I updated my answer. So, my comment about running the `kes` loop backwards is incorrect. It should run forwards, and break out as soon as the test fails. And if `distance_sq > distance_sq_thresh`, you should `continue` to the next `jes` loop iteration without ever calculating the `sqrt`, calculating `gpspec`, or running the `kes` loop. – paddy Oct 13 '20 at 21:52
for each kes, many calculations of sqrt are repeated again. It will be better if i use the values for sqrt from previous kes. By your way, the calculation can reduce twice, however, it is still expensive, the cost can reduce further. – Luc Nguyen Oct 13 '20 at 21:55
See, this is why it's so ineffective for you to post a new question as an answer to your old question. You don't understand what I'm saying at all. It's not appropriate for me to explain this in a new answer, and nobody else is watching this discussion to contribute and put you on the right track. – paddy Oct 13 '20 at 21:57
1

I suggest you read all my comments carefully, over and over, until you get it. – paddy Oct 13 '20 at 21:59
I have updated my answer to incorporate these comments. – paddy Oct 13 '20 at 22:23
Oh, thank you very much. I think the updated main section is very nice. Could you teach me to parallelize this section using openmp? @paddy – Luc Nguyen Oct 13 '20 at 22:33
The distance_sq_thresh[kes] = max_distance * max_distance / kes/kes; – Luc Nguyen Oct 13 '20 at 22:35
Do i need to post a new question with your answer to ask for parallelization with OpenMP? Sorry for this question. I do not much know the forum principles. However, i know that this forum is very, very helpful. – Luc Nguyen Oct 13 '20 at 22:40
At this stage, your question is probably better suited for codereview.stackexchange.com. But of course, you could post it here if you want to ask specifically how to get the `Ees` array to function like a reduction in OpenMP, and yes post whatever your latest code is. It would pay to reference this question too, which provides a lot of context due to the extended discussions. – paddy Oct 13 '20 at 22:51
why did you write distan += distance? i do not understand; why not distan = kes x distance? – Luc Nguyen Oct 13 '20 at 22:54
Addition on some architectures is faster than multiplication. On very old CPUs, it's massively faster. On very new CPUs, it's about the same. Performing multiple additions will suffer from rounding error. If you don't like that small optimization, don't use it. – paddy Oct 13 '20 at 23:00
Thank you very much for your comments. I really appreciate your support. – Luc Nguyen Oct 13 '20 at 23:08

How can I speed up a code C++ using OpenMP?

2 Answers2

Update