Is there a way to further improve sparse solution times using python?

Question

I have been trying different sparse solvers available in Python 3 and comparing the performance between them and also against Octave and Matlab. I have chosen both direct and iterative approaches, I will explain this more in detail below.

To generate a proper sparse matrix, with a banded structure, a Poisson's problem is solved using finite elements with squared grids of N=250, N=500 and N=1000. This results in dimensions of a matrix A=N^2xN^2 and a vector b=N^2x1, i.e., the largest NxN is a million. If one is interested in replicating my results, I have uploaded the matrices A and the vectors b in the following link (it will expire en 30 days) Get systems used here. The matrices are stored in triplets I,J,V, i.e. the first two columns are the indices for the rows and columns, respectively, and the third column are the values corresponding to such indices. Observe that there are some values in V, which are nearly zero, are left on purpose. Still, the banded structure is preserved after a "spy" matrix command in both Matlab and Python.

For comparison, I have used the following solvers:

Matlab and Octave, direct solver: The canonical x=A\b.

Matlab and Octave, pcg solver: The preconditioned conjugated gradient, pcg solver pcg(A,b,1e-5,size(b,1)) (not preconditioner is used).

Scipy (Python), direct solver: linalg.spsolve(A, b) where A is previously formatted in csr_matrix format.

Scipy (Python), pcg solver: sp.linalg.cg(A, b, x0=None, tol=1e-05)

Scipy (Python), UMFPACK solver: spsolve(A, b) using from scikits.umfpack import spsolve. This solver is apparently available (only?) under Linux, since it make use of the libsuitesparse [Timothy Davis, Texas A&M]. In ubuntu, this has to first be installed as sudo apt-get install libsuitesparse-dev.

Furthermore, the aforementioned python solvers are tested in:

Windows.
Linux.
Mac OS.

Conditions:

Timing is done right before and after the solution of the systems. I.e., the overhead for reading the matrices is not considered.
Timing is done ten times for each system and an average and a standard deviation is computed.

Hardware:

Windows and Linux: Dell intel (R) Core(TM) i7-8850H CPU @2.6GHz 2.59GHz, 32 Gb RAM DDR4.
Mac OS: Macbook Pro retina mid 2014 intel (R) quad-core(TM) i7 2.2GHz 16 Gb Ram DDR3.

Results:

Observations:

Matlab A\b is the fastest despite being in an older computer.
There are notable differences between Linux and Windows versions. See for instance the direct solver at NxN=1e6. This is despite Linux is running under windows (WSL).
One can have a huge scatter in Scipy solvers. This is, if the same solution is run several times, one of the times can just increase more than twice.
The fastest option in python can be nearly four times slower than the Matlab running in a more limited hardware. Really?

If you want to reproduce the tests, I leave here very simple scripts. For matlab/octave:

IJS=load('KbN1M.txt');
b=load('FbN1M.txt');

I=IJS(:,1);
J=IJS(:,2);
S=IJS(:,3);

Neval=10;
tsparse=zeros(Neval,1);
tsolve_direct=zeros(Neval,1);
tsolve_sparse=zeros(Neval,1);
tsolve_pcg=zeros(Neval,1);
for i=1:Neval
    tic
    A=sparse(I,J,S);
    tsparse(i)=toc;
    tic
    x=A\b;
    tsolve_direct(i)=toc;        
    tic
    x2=pcg(A,b,1e-5,size(b,1));
    tsolve_pcg(i)=toc;
end

save -ascii octave_n1M_tsparse.txt tsparse
save -ascii octave_n1M_tsolvedirect.txt tsolve_direct
save -ascii octave_n1M_tsolvepcg.txt tsolve_pcg

For python:

import time
from scipy import sparse as sp
from scipy.sparse import linalg
import numpy as np
from scikits.umfpack import spsolve, splu #NEEDS LINUX


b=np.loadtxt('FbN1M.txt')
triplets=np.loadtxt('KbN1M.txt')

I=triplets[:,0]-1
J=triplets[:,1]-1
V=triplets[:,2]

I=I.astype(int)
J=J.astype(int)
NN=int(b.shape[0])

Neval=10
time_sparse=np.zeros((Neval,1))
time_direct=np.zeros((Neval,1))
time_conj=np.zeros((Neval,1))
time_umfpack=np.zeros((Neval,1))
for i in range(Neval):
    t = time.time()
    A=sp.coo_matrix((V, (I, J)), shape=(NN, NN))
    A=sp.csr_matrix(A)
    time_sparse[i,0]=time.time()-t
    t = time.time()
    x=linalg.spsolve(A, b)
    time_direct[i,0] = time.time() - t
    t = time.time()
    x2=sp.linalg.cg(A, b, x0=None, tol=1e-05)
    time_conj[i,0] = time.time() - t
    t = time.time()
    x3 = spsolve(A, b) #ONLY IN LINUX
    time_umfpack[i,0] = time.time() - t

np.savetxt('pythonlinux_n1M_tsparse.txt',time_sparse,fmt='%.18f')
np.savetxt('pythonlinux_n1M_tsolvedirect.txt',time_direct,fmt='%.18f')
np.savetxt('pythonlinux_n1M_tsolvepcg.txt',time_conj,fmt='%.18f')
np.savetxt('pythonlinux_n1M_tsolveumfpack.txt',time_umfpack,fmt='%.18f')

Is there a way to further improve sparse solution times using python? or at least be in a similar order of performance as Matlab? I am open to suggestions using C/C++ or Fortran and a wrapper for python, but I belive it will not get much better than the UMFPACK choice. Suggestions are very welcome.

P.S. I am aware of previous posts, e.g. scipy slow sparse matrix solver Issues using the scipy.sparse.linalg linear system solvers How to use Numba to speed up sparse linear system solvers in Python that are provided in scipy.sparse.linalg? But I think none is as comprehensive as this one, highlighting even more issues between operative systems when using python libraries.

EDIT_1: I add a new plot with results using the QR solver from intel MKL using a python wrapper as suggested in the comments. This is, however, still behind Matlab's performance. To do this, one needs to add:

from sparse_dot_mkl import sparse_qr_solve_mkl

and

sparse_qr_solve_mkl(A.astype(np.float32), b.astype(np.float32))

to the scripts provided in the original post. The ".astype(np.float32)" can be omitted, and the performance gets slighlty worse (about 10 %) for this system.

Try the mkl solvers. I use the mkl QR solver a lot for direct solves because it's substantially faster than the scipy direct solvers. FGMRES is a good iterative solver for some things as well but I've found it to be kinda sensitive to my shitty initial approx. — CJR, Oct 17 '20 at 17:40
Thanks for your reply. I am aware of intel MKL routines, but I ignore if there are wrappers/libraries to use such routines in python. Is this latter the case you are referring to? If so, I would appreciate if you could refer to some documentation. In any case, some MKL routines use BLAS, LAPACK, and alike routines in the background, isn't it? It is also probably that one can achieve a better optimization levels using such routines intel compilers. Anyways, using any of these tools via Python would be more than desirable. — uom0, Oct 17 '20 at 18:51
I have my own wrapper for FGMRES - honestly they're pretty easy to call from inside python because you can get pointers from numpy arrays really easily. Google says that there's a wrapper for the QR solver though: https://pypi.org/project/sparse-dot-mkl/ — CJR, Oct 17 '20 at 19:38
Thanks for your reply. I have tried the QR solver using the wrapper you suggested. I have added a plot to my original post. It is, however, still a bit behind the Matlab's performance, very similar to UFMPACK. I am very curious to test this problem in something like Petsc. By the way, are you aware of a better conjugate gradient solver than the one in Scipy? In my experience, in Matlab using an incomplete cholesky decomposition together with a preconditioning in PCG can be very useful for very large systems. Faster than direct solvers. Looking forward for something like this in scipy. Cheers!. — uom0, Oct 17 '20 at 21:30
I'd probably use the CG solver in mkl. There's a LU preconditioner as well. Just in general I tend to use mkl for sparse solvers instead of scipy. I don't know of any packages that act as wrappers for the iterative solvers though. — CJR, Oct 17 '20 at 22:40

uom0 · Accepted Answer · 2020-10-20T12:56:17.907

I will try to answer to myself. To provide an answer, I tried an even more demanding example, with a matrix of size of (N,N) of about half a million by half a million and the corresponding vector (N,1). This, however, is much less sparse (more dense) than the one provided in the question. This matrix stored in ascii is of about 1.7 Gb, compared to the one of the example, which is of about 0.25 Gb (despite its "size" is larger). See its shape here,

Then, I tried to solve Ax=b using again Matlab, Octave and Python using the aforementioned the direct solvers from scipy, the intel MKL wrapper, the UMFPACK from Tim Davis. My first surprise is that both Matlab and Octave could solve the systems using the A\b, which is not for certain that it is a direct solver, since it chooses the best solver based on the characteristics of the matrix, see Matlab's x=A\b. However, the python's linalg.spsolve , the MKL wrapper and the UMFPACK were throwing out-of-memory errors in Windows and Linux. In mac, the linalg.spsolve was somehow computing a solution, and alghouth it was with a very poor performance, it never through memory errors. I wonder if the memory is handled differently depending on the OS. To me, it seems that mac swapped memory to the hard drive rather than using it from the RAM. The performance of the CG solver in Python was rather poor, compared to the matlab. However, to improve the performance in the CG solver in python, one can get a huge improvement in performance if A=0.5(A+A') is computed first (if one obviously, have a symmetric system). Using a preconditioner in Python did not help. I tried using the sp.linalg.spilu method together with sp.linalg.LinearOperator to compute a preconditioner, but the performance was rather poor. In matlab, one can use the incomplete Cholesky decomposition.

For the out-of-memory problem the solution was to use an LU decomposition and solve two nested systems, such as Ax=b, A=LL', y=L\b and x=y\L'.

I put here the min. solution times,

Matlab mac, A\b = 294 s.
Matlab mac, PCG (without conditioner)= 17.9 s.
Matlab mac, PCG (with incomplete Cholesky conditioner) = 9.8 s.
Scipy mac, direct = 4797 s.
Octave, A\b = 302 s.
Octave, PCG (without conditioner)= 28.6 s.
Octave, PCG (with incomplete Cholesky conditioner) = 11.4 s.
Scipy, PCG (without A=0.5(A+A'))= 119 s.
Scipy, PCG (with A=0.5(A+A'))= 12.7 s.
Scipy, LU decomposition using UMFPACK (Linux) = 3.7 s total.

So the answer is YES, there are ways to improve the solution times in scipy. The use of the wrappers for UMFPACK (Linux) or intel MKL QR solver is highly recommended, if the memmory of the workstation allows it. Otherwise, performing A=0.5(A+A') prior to using the conjugate gradient solver can have a positive effect in the solution performance if one is dealing with symmetric systems. Let me know if someone would be interested in having this new system, so I can upload it.

Is there a way to further improve sparse solution times using python?

1 Answers1