6

I am attempting to parallelize the below loop with an OpenMP reduction;

#define EIGEN_DONT_PARALLELIZE
#include <iostream>
#include <cmath>
#include <string>
#include <eigen3/Eigen/Dense>
#include <eigen3/Eigen/Eigenvalues>
#include <omp.h>

using namespace Eigen;
using namespace std;

VectorXd integrand(double E)
{
    VectorXd answer(500000);
    double f = 5.*E + 32.*E*E*E*E;
    for (int j = 0; j !=50; j++)
        answer[j] =j*f;
    return answer;
}

int main()
{
    omp_set_num_threads(4);
    double start = 0.;
    double end = 1.;
    int n = 100;
    double h = (end - start)/(2.*n);

    VectorXd result(500000);
    result.fill(0.);
    double E = start;
    result = integrand(E);
    #pragma omp parallel
    { 
    #pragma omp for nowait
    for (int j = 1; j <= n; j++){
        E = start + (2*j - 1.)*h;
        result = result + 4.*integrand(E);
        if (j != n){
            E = start + 2*j*h;
            result = result + 2.*integrand(E);
        }
    }
    }
    for (int i=0; i <50 ; ++i)
        cout<< i+1 << " , "<< result[i] << endl;

    return 0;
}

This is definitely faster in parallel than without, but with all 4 threads, the results are hugely variable. When the number of threads is set to 1, the output is correct. I would be most grateful if someone could assist me with this...

I am using the clang compiler with compile flags;

clang++-3.8 energy_integration.cpp -fopenmp=libiomp5

If this is a bust, then I'll have to learn to implement Boost::thread, or std::thread...

AlexD
  • 325
  • 2
  • 6
  • Add `firstprivate(params) reduction(+:result_int)` to your `parallel` directive, remove the `critical` and try again... – Gilles Nov 09 '16 at 04:51
  • 1
    @Gilles thank you for your response.. I have edited my code so that the first `#pragma` statement reads `#pragma omp parallel firstprivate(params) reduction(+:result_int)`, the second `#pragma` statement remains as is, and all subsequent `#pragma` statements are removed. The program then yields the runtime error: `....const Eigen::Matrix >]: Assertion aLhs.rows() == aRhs.rows() && aLhs.cols() == aRhs.cols()' failed. Aborted` - I can assure that both kspace and result_int have the same number of elements and dimensionality – AlexD Nov 09 '16 at 19:34
  • 1
    Can you round out your example to a full [mcve]? Also, does the serial version work as expected? – Avi Ginsburg Nov 10 '16 at 08:15
  • @AviGinsburg thank you, please see above, now edited. Yes, the serial version works as expected – AlexD Nov 14 '16 at 12:13
  • This is not Complete. There is no definition of `integrand(...)` nor is your code compilable. You also have not answered if the serial version returns the correct result. – Avi Ginsburg Nov 14 '16 at 12:29
  • @AviGinsburg this is now amended, the code above is a very stripped down version of what I am trying to accomplish, but I believe the answer will help my purposes – AlexD Nov 14 '16 at 15:47

1 Answers1

4

Your code does not define a custom reduction for OpenMP to reduce the Eigen objects. I'm not sure if clang supports user defined reductions (see OpenMP 4 spec, page 180). If so, you can declare a reduction and add reduction(+:result) to the #pragma omp for line. If not, you can do it yourself by changing your code as follows:

VectorXd result(500000); // This is the final result, not used by the threads
result.fill(0.);
double E = start;
result = integrand(E);
#pragma omp parallel
{
    // This is a private copy per thread. This resolves race conditions between threads
    VectorXd resultPrivate(500000);
    resultPrivate.fill(0.);
#pragma omp for nowait// reduction(+:result) // Assuming user-defined reductions aren't allowed
    for (int j = 1; j <= n; j++) {
        E = start + (2 * j - 1.)*h;
        resultPrivate = resultPrivate + 4.*integrand(E);
        if (j != n) {
            E = start + 2 * j*h;
            resultPrivate = resultPrivate + 2.*integrand(E);
        }
    }
#pragma omp critical
    {
        // Here we sum the results of each thread one at a time
        result += resultPrivate;
    }
}

The error you're getting (in your comment) seems to be due to a size mismatch. While there isn't a trivial one in your code itself, don't forget that when OpenMP starts each thread, it has to initialize a private VectorXd per thread. If none is supplied, the default would be VectorXd() (with a size of zero). When this object is the used, the size mismatch occurs. A "correct" usage of omp declare reduction would include the initializer part:

#pragma omp declare reduction (+: VectorXd: omp_out=omp_out+omp_in)\
     initializer(omp_priv=VectorXd::Zero(omp_orig.size()))

omp_priv is the name of the private variable. It gets initialized by VectorXd::Zero(...). The size is specified using omp_orig. The standard (page 182, lines 25-27) defines this as:

The special identifier omp_orig can also appear in the initializer-clause and it will refer to the storage of the original variable to be reduced.

In our case (see full example below), this is result. So result.size() is 500000 and the private variable is initialized to the correct size.

#include <iostream>
#include <string>
#include <Eigen/Core>
#include <omp.h>

using namespace Eigen;
using namespace std;

VectorXd integrand(double E)
{
    VectorXd answer(500000);
    double f = 5.*E + 32.*E*E*E*E;
    for (int j = 0; j != 50; j++)   answer[j] = j*f;
    return answer;
}

#pragma omp declare reduction (+: Eigen::VectorXd: omp_out=omp_out+omp_in)\
     initializer(omp_priv=VectorXd::Zero(omp_orig.size()))

int main()
{
    omp_set_num_threads(4);
    double start = 0.;
    double end = 1.;
    int n = 100;
    double h = (end - start) / (2.*n);

    VectorXd result(500000);
    result.fill(0.);
    double E = start;
    result = integrand(E);

#pragma omp parallel for reduction(+:result)
    for (int j = 1; j <= n; j++) {
        E = start + (2 * j - 1.)*h;
        result += (4.*integrand(E)).eval();
        if (j != n) {
            E = start + 2 * j*h;
            result += (2.*integrand(E)).eval();
        }
    }
    for (int i = 0; i < 50; ++i)
        cout << i + 1 << " , " << result[i] << endl;

    return 0;
}
Community
  • 1
  • 1
Avi Ginsburg
  • 10,323
  • 3
  • 29
  • 56
  • Excellent, thanks. This gets me a 2.17x speed increase with 4 threads going. The user defined reduction didn't work for me, but the run-time error makes me wonder whether it is to do with Eigen, rather than Clang. EDIT just tried this with g++ which doesn't even compile `‘result’ has invalid type for ‘reduction’` – AlexD Nov 14 '16 at 21:10
  • What runtime error occurred? With what code? It may also be that Eigen types don't play nicely with omp, I haven't tried. – Avi Ginsburg Nov 15 '16 at 12:42
  • 1
    I think you're right about Eigen and omp. I have tested this on the code I posted with your user defined reduction. The runtime error is the same even if only one thread is set, excerpt shown as output too long: `[BinaryOp = Eigen::internal::scalar_sum_op, Lhs = const Eigen::Matrix, Rhs = const Eigen::CwiseUnaryOp, const Eigen::Matrix >]: Assertion \`aLhs.rows() == aRhs.rows() && aLhs.cols() == aRhs.cols()' failed. Aborted` I hasten to add, the alternative method you outline, works a treat. – AlexD Nov 15 '16 at 23:13
  • 1
    Thanks for the addendum, I'll do some further reading on OpenMP from the manual you link. I have tried the code you posted with `omp declare reduction` but I get a compile error. Clang yields `error: expected an OpenMP directive #pragma omp declare reduction(+: VectorXd: omp_out=omp_out+omp_in)\ ` and g++ yields `In function ‘int main()’: energy_integral_test.cpp:36:45: error: ‘result’ has invalid type for ‘reduction’ #pragma omp parallel for reduction(+:result)` . You have found a solution that works for me though – AlexD Nov 16 '16 at 20:02
  • Did you add the initializer part? The '\' was just to keep the line from being too long. – Avi Ginsburg Nov 16 '16 at 20:06
  • `#pragma omp declare reduction (+: Eigen::VectorXd: omp_out=omp_out+omp_in) initializer(omp_priv=VectorXd::Zero(omp_orig.size()))` yields the same error (this is all as one line) – AlexD Nov 16 '16 at 20:22
  • That looks like it's from an old version of gcc (<4.9) before OpenMP4.0 was supported. It works for me with gcc 5.X both on Ubuntu and Windows (mingw). – Avi Ginsburg Nov 16 '16 at 20:43
  • yes, I'm using gcc 4.8. Clang only supports OpenMP3.1 at this stage, but I believe OpenMP4.0 and OpenMP4.5 are being implemented in future builds – AlexD Nov 16 '16 at 20:49