How to efficiently and accurately parallelize Eigen Pardiso Solve

Question

In order to speed up some physics simulation code, I am using Pardiso with Eigen. Compared to alternative solvers such as Eigen's SparseLU, Pardiso gives me a significant boost in the factorize step which increases as I increase the number of threads.

In my simulation code, though, the RHS of this solve is not a single vector but a matrix. I would imagine this to be fine from a parallel perspective because each back solve is independent, but I notice no speedup in the solve step when increasing the thread count.

Because of this, I tried using OpenMP to back solve in parallel. While apparently faster, this keeps returning an incorrect result. I've set up a test roughly like this...

SparseMatrix<double> A = ...;
SparseMatrix<double> B = ...;
mkl_set_num_threads(t);

// Factorize
pardiso.factorize(A);

// Solve1
SparseMatrix<double> X1 = pardiso.solve(B);

// Solve2
SparseMatrix<double> X2(A.rows(), B.cols());
#pragma omp parallel num_threads(t)
{
#pragma omp for
    {
         for (int n = 0; n < B.cols(); ++n)
        {
             X2.col(n) = pardiso.solve(B.col(n));
        }
    }
}

X2.isApprox(X1); // I know X1 is giving me the correct matrix, but this return false

I'm curious if anyone knows why Solve1 is the same speed no matter the num_threads, or if anyone knows why Solve2 returns the wrong matrix.

I'm on Windows using VS2019, compiling with the Intel C++ compiler, have Use MKL set to parallel, and OpenMP on. I've tested OpenMP is on and working, Solve2 is correct when the omp for uses a single thread, and Solve2 is correct using OpenMP and Eigen's SparseLU. I've also looked over all these previous questions to no prevail.

Eigen::PardisoLU (MKL) in parallel scenario

Calling multithreaded MKL in from openmp parallel region

Number of threads of Intel MKL functions inside OMP parallel regions

I still end up with an incorrect output. Setting mkl_set_dynamic(1) and omp_set_nested(1) before everything doesn't change anything.

Remember that creating threads has a cost of its own. You need enough work for the thread to justify the overhead of thread creation. Also remember that increasing the number of threads beyond the available hardware resources will just slow things down. — Jesper Juhl, Mar 27 '20 at 20:15
What is the size of the B matrix? I suspect your code is bounded by the memory hierarchy. Can you provide a [minimal working example](https://stackoverflow.com/help/minimal-reproducible-example)? — Jérôme Richard, Mar 28 '20 at 12:33

How to efficiently and accurately parallelize Eigen Pardiso Solve

0 Answers0