Background: I'm doing a project for my Numerical Linear Algebra course. For this project I decided to experiment with doing incomplete cholesky factorization with half precision arithmetic and using the result as a preconditioner for iterative methods. I first tried implementing this Matlab 2019b (which has a half-precision datatype) but it doesn't support half-precision sparse matrices, so I had to use full matrices. But arithmetic with half precision is a lot slower in Matlab and I found that it took like 20 minutes to factor like a 500 x 500 matrix (and I want to get up to like 1000 x 1000). Though, in single / double precision a 500 x 500 matrix took less than a second .
I thought I'd have better luck scaling to higher matrices if I could actually take advantage of the sparsity of the matrix. I remembered that numpy/scipy has a float 16 datatype so I decided to try and implement this in python. So I wrote this
from scipy.io import loadmat
def icholesky(a):
n = a.shape[0]
for k in tqdm(list(range(n))):
a[k,k] = np.sqrt(a[k,k])
#for i in range(k+1,n):
# if (a[i,k] !=0):
# a[i,k] = a[i,k]/a[k,k]
i,_= a[:,k].nonzero()
if len(i) > 0:
a[i,k] = a[i,k]/a[k,k]
for j in range(k+1,n):
#for i in range(j,n):
# if (a[i,j]!=0):
# a[i,j] = a[i,j]-a[i,k]*a[j,k]
i,_ = a[j:,j].nonzero()
if len(i) > 0:
a[i,j] = a[i,j] - a[i,k]*a[j,k]
return a
bus = loadmat('494_bus.mat') #From University of Florida's Sparse Matrix Collection
A = bus['Problem'][0,0][1]
H = A.copy()
icholesky(H)
Where 'a' will be a scipy sparse matrix with a CSC format. (The commented out code is just the algorithm fully written out, not making an attempt to take advantage of the sparsity). I found that this took about 6 minutes to run , which is much faster than the MATLAB code when I use half precision floats but still a lot slower than the matlab code when I use single/double precision floats (which took less than a second), even though MATLAB is using full matrices.
There's always the possibility that I just made a mistake in my code somewhere and I'm not actually obtaining the correct run times so I'll look that over again. But I'm wondering if anyone more accustomed to scipy / numpy sees anything amiss about the way I chose to implement the above code.
I do have one other theory for why the python code may be so slow. I'm running this on my school's high performance computer, and it could be the case that matlab is set up to automatically take advantage of the parallelism but python is not. Does this seem like a reasonable hypothesis? If so, do you have any suggestions for how I could properly parallelize my algorithm?