0

Background: I'm doing a project for my Numerical Linear Algebra course. For this project I decided to experiment with doing incomplete cholesky factorization with half precision arithmetic and using the result as a preconditioner for iterative methods. I first tried implementing this Matlab 2019b (which has a half-precision datatype) but it doesn't support half-precision sparse matrices, so I had to use full matrices. But arithmetic with half precision is a lot slower in Matlab and I found that it took like 20 minutes to factor like a 500 x 500 matrix (and I want to get up to like 1000 x 1000). Though, in single / double precision a 500 x 500 matrix took less than a second .

I thought I'd have better luck scaling to higher matrices if I could actually take advantage of the sparsity of the matrix. I remembered that numpy/scipy has a float 16 datatype so I decided to try and implement this in python. So I wrote this

from scipy.io import loadmat
def icholesky(a):
    n = a.shape[0]
    for k in tqdm(list(range(n))): 
        a[k,k] = np.sqrt(a[k,k])
        #for i in range(k+1,n):
        #    if (a[i,k] !=0):
        #        a[i,k] = a[i,k]/a[k,k]
        i,_= a[:,k].nonzero()
        if len(i) > 0:
            a[i,k] = a[i,k]/a[k,k]
        for j in range(k+1,n):
            #for i in range(j,n):
            #    if (a[i,j]!=0):
            #        a[i,j] = a[i,j]-a[i,k]*a[j,k]  
            i,_ = a[j:,j].nonzero()
            if len(i) > 0: 
                a[i,j]  = a[i,j] - a[i,k]*a[j,k]     
    return a

bus = loadmat('494_bus.mat') #From University of Florida's Sparse Matrix Collection
A = bus['Problem'][0,0][1]
H = A.copy()
icholesky(H)

Where 'a' will be a scipy sparse matrix with a CSC format. (The commented out code is just the algorithm fully written out, not making an attempt to take advantage of the sparsity). I found that this took about 6 minutes to run , which is much faster than the MATLAB code when I use half precision floats but still a lot slower than the matlab code when I use single/double precision floats (which took less than a second), even though MATLAB is using full matrices.

There's always the possibility that I just made a mistake in my code somewhere and I'm not actually obtaining the correct run times so I'll look that over again. But I'm wondering if anyone more accustomed to scipy / numpy sees anything amiss about the way I chose to implement the above code.

I do have one other theory for why the python code may be so slow. I'm running this on my school's high performance computer, and it could be the case that matlab is set up to automatically take advantage of the parallelism but python is not. Does this seem like a reasonable hypothesis? If so, do you have any suggestions for how I could properly parallelize my algorithm?

Onye
  • 195
  • 1
  • 7
  • Please repeat [MRE]() from the intro tour. We need your code for reproducing the problem, as well as the full result. Your posted code takes more work than usual: it doesn't do anything (no driver program), you've used a lot of meaningless variables, and there's no output tracing the results you discuss. – Prune Apr 29 '20 at 18:45
  • How is the performance of the scipy code with single precision (i.e. `np.float32`) instead of half precision? – Warren Weckesser Apr 29 '20 at 18:49
  • 1
    Python is much slower than MATLAB when interpreting code. Therefore, code with loops tends to run much slower in Python than in MATLAB. Also, MATLAB's half-float type is a class, not a built-in type, which is why it's so slow. – Cris Luengo Apr 29 '20 at 18:56
  • Prune: I've modified my code to include the matrix I ran it on. Which variables are meaningless? I don't see how I could use fewer variables than I did. WarrenWeckesser : The datatype doesn't seem to affect the runtime from what I can tell. CrisLuengo: Thanks that helps. Maybe I can find a better way to iterate through the matrix entries – Onye Apr 29 '20 at 19:58
  • It is difficult to give definitive answers because of so many variables, but half precision usually not an option to speed up your computation. The corresponding instructions where introduced as late as 2012 and offers only minor performance benefits over single precision, especially for smaller data[1]. Probably no one uses such instructions for distributed software, because it would require at least a 3rd generation intel. [1]: https://software.intel.com/en-us/articles/performance-benefits-of-half-precision-floats – Daniel Apr 29 '20 at 20:18
  • Daniel: Sorry, I think I didn't explain myself clearly. I wasn't hoping the half precision would make the code faster. I was just wondering why the python code was so much slower than the matlab code. Like, it's over a thousand times slower. So I'm looking for advice on best practices to properly approach a problem like this. Like Cris pointing out that my loops may be causing things to run slow. So if I figure out a way to run the algorithm in a more streamlined fashion I should be able to make it run faster. Mostly, I was just wondering if I was doing something particularly foolish – Onye Apr 29 '20 at 21:41

1 Answers1

1

I figured out a way to alleviate the problem. I don't need j to iterate through all the numbers from k+1 to n. I just need to go through the j where a[j,k] is non-zero. Which is what was just calculated when I calculated i_.

def icholesky(a):
    n = a.shape[0]
    for k in range(n): 
        a[k,k] = np.sqrt(a[k,k])
        i_,_ = a[k+1:,k].nonzero() 
        if len(i_) > 0:
            i_= i_ + (k+1)
            a[i_,k] = a[i_,k]/a[k,k]
        for j in i_: 
            i2_,_ = a[j:n,j].nonzero()
            if len(i2_) > 0:
                i2_ = i2_ + j
                a[i2_,j]  = a[i2_,j] - a[i2_,k]*a[j,k]   


    return a
Onye
  • 195
  • 1
  • 7
  • Is the input `a` a CSC matrix? If so, have you tried using `a.indices` and `a.indptr` to automatically travel though the nonzeros? Maybe doing `a.sum_duplicates()` before the `for` loops is a good idea too. I'm going to test this :D – Breno Aug 12 '20 at 20:00