Eigen & OpenMP : No parallelisation due to false sharing and thread overhead

Question

System Specification:

Intel Xeon E7-v3 Processor(4 sockets, 16 cores/sockets, 2 threads/core)
Use of Eigen family and C++

Following is serial implementation of code snippet:

Eigen::VectorXd get_Row(const int j, const int nColStart, const int nCols) {

    Eigen::VectorXd row(nCols);
    for (int k=0; k<nCols; ++k) {
        row(k) = get_Matrix_Entry(j,k+nColStart);
    }

} 

double get_Matrix_Entry(int x , int y){
    return exp(-(x-y)*(x-y));
}

I need to parallelise the get_Row part as nCols can be as large as 10^6, therefore, I tried certain techniques as:

Naive parallelisation:

Eigen::VectorXd get_Row(const int j, const int nColStart, const int nCols) {  
    Eigen::VectorXd row(nCols);

     #pragma omp parallel for schedule(static,8)    
     for (int k=0; k<nCols; ++k) {
          row(k)    =   get_Matrix_Entry(j,k+nColStart);

     return row;
}

Strip Mining:

Eigen::VectorXd get_Row(const int j, const int nColStart, const int nCols) { 
    int vec_len = 8;
    Eigen::VectorXd row(nCols) ;
    int i,cols;
    cols=nCols;
    int rem = cols%vec_len;
    if(rem!=0)
        cols-=rem;

    #pragma omp parallel for    
    for(int ii=0;ii<cols; ii+=vec_len){
         for(i=ii;i<ii+vec_len;i++){
             row(i) = get_Matrix_Entry(j,i+nColStart);
         }
    }

    for(int jj=i; jj<nCols;jj++)
        row(jj) = get_Matrix_Entry(j,jj+nColStart);

    return row;
}

Somewhere from internet to avoid false sharing:

Eigen::VectorXd get_Row(const int j, const int nColStart, const int nCols) {
    int cache_line_size=8;
    Eigen::MatrixXd row_m(nCols,cache_line_size);

    #pragma omp parallel for schedule(static,1)
    for (int k=0; k<nCols; ++k) 
        row_m(k,0)  =   get_Matrix_Entry(j,k+nColStart);

    Eigen::VectorXd row(nCols); 
    row = row_m.block(0,0,nCols,1);

   return row;

}

OUTPUT:

None of the above techniques helped in reducing the time taken to execute get_row for large nCols implying naice parallelisation worked similar to the other techniques(although better from serial), any suggestions or method that can help to improve the time?

As mentioned by user Avi Ginsburg, I am mentioning some other system details:

g++(GCC) is compiler with version 4.4.7
Eigen Library Version is 3.3.2
Compiler flags used: "-c -fopenmp -Wall -march=native -O3 -funroll-all-loops -ffast-math -ffinite-math-only -I header" , here header is folder containing Eigen.
Output of gcc -march=native -Q --help=target->(Description of some flags are mentioned only):

-mavx [enabled]

-mfancy-math-387 [enabled]

-mfma [disabled]

-march= core2

For full desciption of flags please see this.

How did you measure the time? Did you profile your code to identify that this specific part is the hot stop? And if so, how much of the sequential wall time does that take? And what makes you believe that false sharing has anything to do with the issue at hand? — Gilles, Jan 25 '17 at 08:31
On E7, affinity setting will be particularly important, so as to pin threads to distinct cores and avoid hopping among cpu. I trust you don't insist on a compiler library lacking this facility. — tim18, Jan 25 '17 at 15:41
@tim18 I have kept the environment variable proc_affinity as true still not helping — user7440094, Jan 26 '17 at 10:29
@Gilles For time measurement I have used omp_get_wtime() and profiling is done through Vtune. The serial code for N = 6553600 runs for 20 seconds and for parallel it runs sometimes for 18 or 22 secs. This part of code gets called many a times(30 times approx)so this function needs to be optimised but as u can see it's not. — user7440094, Jan 26 '17 at 10:33
@user7440094 How are you compiling? Using a single thread on an i7 with `-march=native -O3` and N = 6553600 takes ~0.15 sec per function call. Removing the mentioned flags results in ~0.5 sec per function call. Using my answer below with OMP, I get ~0.015 sec per function call. — Avi Ginsburg, Jan 26 '17 at 15:41
If your openmp supports omp_places=cores or a numeric equivalent that may help. — tim18, Jan 26 '17 at 17:51
@AviGinsburg I'm using " -c -pg -fopenmp -Wall -O3 -funroll-loops " flags the serial code takes 0.20 seconds but cumulatively it takes 20 seconds as its called many times(like 80-90 times),the parallel version mentoned by me also takes nearly same time. — user7440094, Jan 27 '17 at 07:44
@user7440094 Can you try the serial version I wrote in my [answer](http://stackoverflow.com/a/41847107/2899559) with `-march=native`, enabling AVX/fma? — Avi Ginsburg, Jan 27 '17 at 07:49
@user7440094 Why do you use `-pg` flag? It inserts additional code to write profile data, and negatively affects performance. Didi you try without that flag? — Ilya Popov, Jan 27 '17 at 23:27
@IlyaPopov I used it for profiling with gprof,yes u r right it would slow down the performance with multithreaded version,thanks ! for pointing it out. — user7440094, Jan 28 '17 at 07:55

score 2 · Answer 1 · edited Jun 20 '20 at 09:12

Try rewriting your functions as a single expression and let Eigen vectorize itself, i.e.:

Eigen::VectorXd get_Row(const int j, const int nColStart, const int nCols) {

    Eigen::VectorXd row(nCols);

    row = (-( Eigen::VectorXd::LinSpaced(nCols, nColStart, nColStart + nCols - 1).array()
                      - double(j)).square()).exp().matrix();

    return row;
}

Make sure to use -mavx and -mfma (or -march=native) when compiling. Gives me a x4 speedup on an i7 (I know you are talking about trying to use 64/128 threads, but this is with a single thread).

You can enable openmp for some further speedup by dividing the computation into segments:

Eigen::VectorXd get_Row_omp(const int j, const int nColStart, const int nCols) {

    Eigen::VectorXd row(nCols);

#pragma omp parallel
    {
        int num_threads = omp_get_num_threads();
        int tid = omp_get_thread_num();
        int n_per_thread = nCols / num_threads;
        if ((n_per_thread * num_threads < nCols)) n_per_thread++;
        int start = tid * n_per_thread;
        int len = n_per_thread;
        if (tid + 1 == num_threads) len = nCols - start;

        if(start < nCols)
            row.segment(start, len) = (-(Eigen::VectorXd::LinSpaced(len,
                               nColStart + start, nColStart + start + len - 1)
                            .array() - double(j)).square()).exp().matrix();

    }
    return row;

}

For me (4 cores), I get an additional ~x3.3 speedup when computing 10^8 elements, but expect this be lower for 10^6 and/or 64/128 cores (normalizing for number of cores, of course).

Edit

I hadn't placed any checks to make sure that the OMP threads didn't go out of bounds and I had mixed up the second and third arguments in the Eigen::VectorXd::LinSpaced of the serial version. That probably accounted for any errors you had. Additionally, I've pasted the code that I used for testing here. I compiled with g++ -std=c++11 -fopenmp -march=native -O3, adapt to your needs.

#include <Eigen/Core>
#include <iostream>
#include <omp.h>


double get_Matrix_Entry(int x, int y) {
        return exp(-(x - y)*(x - y));
}

Eigen::VectorXd get_RowOld(const int j, const int nColStart, const int nCols) {

        Eigen::VectorXd row(nCols);
        for (int k = 0; k<nCols; ++k) {
                row(k) = get_Matrix_Entry(j, k + nColStart);
        }
        return row;
}


Eigen::VectorXd get_Row(const int j, const int nColStart, const int nCols) {

        Eigen::VectorXd row(nCols);

        row = (-( Eigen::VectorXd::LinSpaced(nCols, nColStart, nColStart + nCols - 1).array() - double(j)).square()).exp().matrix();

        return row;
}

Eigen::VectorXd get_Row_omp(const int j, const int nColStart, const int nCols) {

        Eigen::VectorXd row(nCols);

#pragma omp parallel
        {
                int num_threads = omp_get_num_threads();
                int tid = omp_get_thread_num();
                int n_per_thread = nCols / num_threads;
                if ((n_per_thread * num_threads < nCols)) n_per_thread++;
                int start = tid * n_per_thread;
                int len = n_per_thread;
                if (tid + 1 == num_threads) len = nCols - start;


#pragma omp critical
{
        std::cout << tid << "/" << num_threads << "\t" << n_per_thread << "\t" << start <<
                                                         "\t" << len << "\t" << start+len << "\n\n";
}

                if(start < nCols)
                        row.segment(start, len) = (-(Eigen::VectorXd::LinSpaced(len, nColStart + start, nColStart + start + len - 1).array() - double(j)).square()).exp().matrix();

        }
        return row;
}

int main()
{
        std::cout << EIGEN_WORLD_VERSION << '.' << EIGEN_MAJOR_VERSION << '.' << EIGEN_MINOR_VERSION << '\n';
        volatile int b = 3;
        int sz = 6553600;
        sz = 16;
        b = 6553500;
        b = 3;
        {
                auto beg = omp_get_wtime();
                auto r = get_RowOld(5, b, sz);
                auto end = omp_get_wtime();
                auto diff = end - beg;
                std::cout << r.rows() << "\t" << r.cols() << "\n";
//              std::cout << r.transpose() << "\n";
                std::cout << "Old: " << r.mean() << "\n" << diff << "\n\n";

                beg = omp_get_wtime();
                auto r2 = get_Row(5, b, sz);
                end = omp_get_wtime();
                diff = end - beg;
                std::cout << r2.rows() << "\t" << r2.cols() << "\n";
//              std::cout << r2.transpose() << "\n";
                std::cout << "Eigen:         " << (r2-r).cwiseAbs().sum() << "\t" << (r-r2).cwiseAbs().mean() << "\n" << diff << "\n\n";

                auto omp_beg = omp_get_wtime();
                auto r3 = get_Row_omp(5, b, sz);
                auto omp_end = omp_get_wtime();
                auto omp_diff = omp_end - omp_beg;
                std::cout << r3.rows() << "\t" << r3.cols() << "\n";
//              std::cout << r3.transpose() << "\n";
                std::cout << "OMP and Eigen: " << (r3-r).cwiseAbs().sum() << "\t" << (r - r3).cwiseAbs().mean() << "\n" << omp_diff << "\n";
        }

        return 0;

}

I tried it on Xeon processor but its still having 1.3x speedup for N = 10^6 elements, I think overhead playing a significant role here . — user7440094, Jan 28 '17 at 07:56
@user7440094 Agreed. But it really doesn't make much sense; I'm getting the expected speedups, so unless something else is throttling your system, I would recommend a full [mcve] so that comparisons can be made. — Avi Ginsburg, Jan 29 '17 at 09:59
You are right the speedup comes as 4x, I forgot to remove -pg flag , thanks ! Also, the compilation flag used is "-march = native" , was unable to use " -mavx and -mfma" gave errors regarding Eigen Library. — user7440094, Jan 30 '17 at 08:57
I used the newer version it gives me good speedup from the serial code but the time taken is nearly same as to earlier 1. parallelisation technique of mine. — user7440094, Jan 31 '17 at 10:22
Yes I tried that I think your code and naive parallelization (without mentioning size of chunk) for a Xeon processor is taking same time although compared to serial code time taken is reduced.Any ideas why such things is happening also u mentioned that " let Eigen vectorize itself" how is that happening? — user7440094, Jan 31 '17 at 11:26
Eigens vectorization of `sin`, `cos`, `exp`, and `log` are copied from [here](http://gruntthepeon.free.fr/ssemath/). Also, what errors did `-march=native` give you with Eigen? — Avi Ginsburg, Jan 31 '17 at 12:24
I have used "-c -fopenmp -Wall -march=native -O3 -funroll-all-loops -ffast-math -ffinite-math-only -I header" during compilation , here header is the folder that contains Eigen. what do you advise to use as compilaton flags? — user7440094, Feb 01 '17 at 05:52
I used mavx mfma together , this file "header/Eigen/src/Core/arch/AVX/MathFunctions.h:" had some problem then, error : " error: ‘_mm256_fmadd_ps’ was not declared in this scope" — user7440094, Feb 01 '17 at 05:57
If you can, please update your answer with: OS; compiler and version; compiler flags (and the output from `gcc -march=native -Q --help=target ...` or equivalent); the error messages you get (copy paste from the output). All this so maybe someone can reproduce the errors you're getting in order to help you. I don't get any error when compiling with AVX/native, so I don't know what you problem you're having. — Avi Ginsburg, Feb 01 '17 at 06:03
Yes I have edited the question , the mfma flag is disabled.How do I enable it? — user7440094, Feb 01 '17 at 06:57

Eigen & OpenMP : No parallelisation due to false sharing and thread overhead

1 Answers1

Edit

Linked