I have been trying different sparse solvers available in Python 3 and comparing the performance between them and also against Octave and Matlab. I have chosen both direct and iterative approaches, I will explain this more in detail below.
To generate a proper sparse matrix, with a banded structure, a Poisson's problem is solved using finite elements with squared grids of N=250, N=500 and N=1000. This results in dimensions of a matrix A=N^2xN^2 and a vector b=N^2x1, i.e., the largest NxN is a million. If one is interested in replicating my results, I have uploaded the matrices A and the vectors b in the following link (it will expire en 30 days) Get systems used here. The matrices are stored in triplets I,J,V, i.e. the first two columns are the indices for the rows and columns, respectively, and the third column are the values corresponding to such indices. Observe that there are some values in V, which are nearly zero, are left on purpose. Still, the banded structure is preserved after a "spy" matrix command in both Matlab and Python.
For comparison, I have used the following solvers:
Matlab and Octave, direct solver: The canonical x=A\b
.
Matlab and Octave, pcg solver: The preconditioned conjugated gradient, pcg solver pcg(A,b,1e-5,size(b,1))
(not preconditioner is used).
Scipy (Python), direct solver: linalg.spsolve(A, b)
where A is previously formatted in csr_matrix
format.
Scipy (Python), pcg solver: sp.linalg.cg(A, b, x0=None, tol=1e-05)
Scipy (Python), UMFPACK solver: spsolve(A, b)
using from scikits.umfpack import spsolve
. This solver is apparently available (only?) under Linux, since it make use of the libsuitesparse [Timothy Davis, Texas A&M]. In ubuntu, this has to first be installed as sudo apt-get install libsuitesparse-dev
.
Furthermore, the aforementioned python solvers are tested in:
- Windows.
- Linux.
- Mac OS.
Conditions:
- Timing is done right before and after the solution of the systems. I.e., the overhead for reading the matrices is not considered.
- Timing is done ten times for each system and an average and a standard deviation is computed.
Hardware:
- Windows and Linux: Dell intel (R) Core(TM) i7-8850H CPU @2.6GHz 2.59GHz, 32 Gb RAM DDR4.
- Mac OS: Macbook Pro retina mid 2014 intel (R) quad-core(TM) i7 2.2GHz 16 Gb Ram DDR3.
Observations:
- Matlab A\b is the fastest despite being in an older computer.
- There are notable differences between Linux and Windows versions. See for instance the direct solver at NxN=1e6. This is despite Linux is running under windows (WSL).
- One can have a huge scatter in Scipy solvers. This is, if the same solution is run several times, one of the times can just increase more than twice.
- The fastest option in python can be nearly four times slower than the Matlab running in a more limited hardware. Really?
If you want to reproduce the tests, I leave here very simple scripts. For matlab/octave:
IJS=load('KbN1M.txt');
b=load('FbN1M.txt');
I=IJS(:,1);
J=IJS(:,2);
S=IJS(:,3);
Neval=10;
tsparse=zeros(Neval,1);
tsolve_direct=zeros(Neval,1);
tsolve_sparse=zeros(Neval,1);
tsolve_pcg=zeros(Neval,1);
for i=1:Neval
tic
A=sparse(I,J,S);
tsparse(i)=toc;
tic
x=A\b;
tsolve_direct(i)=toc;
tic
x2=pcg(A,b,1e-5,size(b,1));
tsolve_pcg(i)=toc;
end
save -ascii octave_n1M_tsparse.txt tsparse
save -ascii octave_n1M_tsolvedirect.txt tsolve_direct
save -ascii octave_n1M_tsolvepcg.txt tsolve_pcg
For python:
import time
from scipy import sparse as sp
from scipy.sparse import linalg
import numpy as np
from scikits.umfpack import spsolve, splu #NEEDS LINUX
b=np.loadtxt('FbN1M.txt')
triplets=np.loadtxt('KbN1M.txt')
I=triplets[:,0]-1
J=triplets[:,1]-1
V=triplets[:,2]
I=I.astype(int)
J=J.astype(int)
NN=int(b.shape[0])
Neval=10
time_sparse=np.zeros((Neval,1))
time_direct=np.zeros((Neval,1))
time_conj=np.zeros((Neval,1))
time_umfpack=np.zeros((Neval,1))
for i in range(Neval):
t = time.time()
A=sp.coo_matrix((V, (I, J)), shape=(NN, NN))
A=sp.csr_matrix(A)
time_sparse[i,0]=time.time()-t
t = time.time()
x=linalg.spsolve(A, b)
time_direct[i,0] = time.time() - t
t = time.time()
x2=sp.linalg.cg(A, b, x0=None, tol=1e-05)
time_conj[i,0] = time.time() - t
t = time.time()
x3 = spsolve(A, b) #ONLY IN LINUX
time_umfpack[i,0] = time.time() - t
np.savetxt('pythonlinux_n1M_tsparse.txt',time_sparse,fmt='%.18f')
np.savetxt('pythonlinux_n1M_tsolvedirect.txt',time_direct,fmt='%.18f')
np.savetxt('pythonlinux_n1M_tsolvepcg.txt',time_conj,fmt='%.18f')
np.savetxt('pythonlinux_n1M_tsolveumfpack.txt',time_umfpack,fmt='%.18f')
Is there a way to further improve sparse solution times using python? or at least be in a similar order of performance as Matlab? I am open to suggestions using C/C++ or Fortran and a wrapper for python, but I belive it will not get much better than the UMFPACK choice. Suggestions are very welcome.
P.S. I am aware of previous posts, e.g. scipy slow sparse matrix solver Issues using the scipy.sparse.linalg linear system solvers How to use Numba to speed up sparse linear system solvers in Python that are provided in scipy.sparse.linalg? But I think none is as comprehensive as this one, highlighting even more issues between operative systems when using python libraries.
EDIT_1: I add a new plot with results using the QR solver from intel MKL using a python wrapper as suggested in the comments. This is, however, still behind Matlab's performance. To do this, one needs to add:
from sparse_dot_mkl import sparse_qr_solve_mkl
and
sparse_qr_solve_mkl(A.astype(np.float32), b.astype(np.float32))
to the scripts provided in the original post. The ".astype(np.float32)" can be omitted, and the performance gets slighlty worse (about 10 %) for this system.