System Specification:
- Intel Xeon E7-v3 Processor(4 sockets, 16 cores/sockets, 2 threads/core)
- Use of Eigen family and C++
Following is serial implementation of code snippet:
Eigen::VectorXd get_Row(const int j, const int nColStart, const int nCols) {
Eigen::VectorXd row(nCols);
for (int k=0; k<nCols; ++k) {
row(k) = get_Matrix_Entry(j,k+nColStart);
}
}
double get_Matrix_Entry(int x , int y){
return exp(-(x-y)*(x-y));
}
I need to parallelise the get_Row part as nCols can be as large as 10^6, therefore, I tried certain techniques as:
Naive parallelisation:
Eigen::VectorXd get_Row(const int j, const int nColStart, const int nCols) { Eigen::VectorXd row(nCols); #pragma omp parallel for schedule(static,8) for (int k=0; k<nCols; ++k) { row(k) = get_Matrix_Entry(j,k+nColStart); return row; }
Strip Mining:
Eigen::VectorXd get_Row(const int j, const int nColStart, const int nCols) { int vec_len = 8; Eigen::VectorXd row(nCols) ; int i,cols; cols=nCols; int rem = cols%vec_len; if(rem!=0) cols-=rem; #pragma omp parallel for for(int ii=0;ii<cols; ii+=vec_len){ for(i=ii;i<ii+vec_len;i++){ row(i) = get_Matrix_Entry(j,i+nColStart); } } for(int jj=i; jj<nCols;jj++) row(jj) = get_Matrix_Entry(j,jj+nColStart); return row; }
Somewhere from internet to avoid false sharing:
Eigen::VectorXd get_Row(const int j, const int nColStart, const int nCols) { int cache_line_size=8; Eigen::MatrixXd row_m(nCols,cache_line_size); #pragma omp parallel for schedule(static,1) for (int k=0; k<nCols; ++k) row_m(k,0) = get_Matrix_Entry(j,k+nColStart); Eigen::VectorXd row(nCols); row = row_m.block(0,0,nCols,1); return row; }
OUTPUT:
None of the above techniques helped in reducing the time taken to execute get_row for large nCols implying naice parallelisation worked similar to the other techniques(although better from serial), any suggestions or method that can help to improve the time?
As mentioned by user Avi Ginsburg, I am mentioning some other system details:
- g++(GCC) is compiler with version 4.4.7
- Eigen Library Version is 3.3.2
- Compiler flags used: "-c -fopenmp -Wall -march=native -O3 -funroll-all-loops -ffast-math -ffinite-math-only -I header" , here header is folder containing Eigen.
Output of gcc -march=native -Q --help=target->(Description of some flags are mentioned only):
-mavx [enabled]
-mfancy-math-387 [enabled]
-mfma [disabled]
-march= core2
For full desciption of flags please see this.