In order to speed up some physics simulation code, I am using Pardiso with Eigen. Compared to alternative solvers such as Eigen's SparseLU, Pardiso gives me a significant boost in the factorize step which increases as I increase the number of threads.
In my simulation code, though, the RHS of this solve is not a single vector but a matrix. I would imagine this to be fine from a parallel perspective because each back solve is independent, but I notice no speedup in the solve step when increasing the thread count.
Because of this, I tried using OpenMP to back solve in parallel. While apparently faster, this keeps returning an incorrect result. I've set up a test roughly like this...
SparseMatrix<double> A = ...;
SparseMatrix<double> B = ...;
mkl_set_num_threads(t);
// Factorize
pardiso.factorize(A);
// Solve1
SparseMatrix<double> X1 = pardiso.solve(B);
// Solve2
SparseMatrix<double> X2(A.rows(), B.cols());
#pragma omp parallel num_threads(t)
{
#pragma omp for
{
for (int n = 0; n < B.cols(); ++n)
{
X2.col(n) = pardiso.solve(B.col(n));
}
}
}
X2.isApprox(X1); // I know X1 is giving me the correct matrix, but this return false
I'm curious if anyone knows why Solve1 is the same speed no matter the num_threads, or if anyone knows why Solve2 returns the wrong matrix.
I'm on Windows using VS2019, compiling with the Intel C++ compiler, have Use MKL set to parallel, and OpenMP on. I've tested OpenMP is on and working, Solve2 is correct when the omp for uses a single thread, and Solve2 is correct using OpenMP and Eigen's SparseLU. I've also looked over all these previous questions to no prevail.
Eigen::PardisoLU (MKL) in parallel scenario
Calling multithreaded MKL in from openmp parallel region
Number of threads of Intel MKL functions inside OMP parallel regions
I still end up with an incorrect output. Setting mkl_set_dynamic(1)
and omp_set_nested(1)
before everything doesn't change anything.