Pearson correlation on big numpy matrices

Question

I have a 24000 * 316 numpy matrix, each row represents a time series with 316 time points, and I am computing pearson correlation between each pair of these time series. Meaning as a result I would have a 24000 * 24000 numpy matrix having pearson values. My problem is that this takes a very long time. I have tested my pipeline on smaller matrices (200 * 200) and it works (though still slow). I am wondering if it is expected to be this slow (takes more than a day!!!). And what I might be able to do about it... If it helps this is my code... nothing special or hard..

def SimMat(mat,name):

    mrange = mat.shape[0]
    print "mrange:", mrange
    nTRs = mat.shape[1]
    print "nTRs:", nTRs

    SimM = numpy.zeros((mrange,mrange))



    for i in range(mrange):

        SimM[i][i] = 1

    for i in range (mrange):
        for j in range(i+1, mrange):
            pearV = scipy.stats.pearsonr(mat[i], mat[j])

            if(pearV[1] <= 0.05):
                if(pearV[0] >= 0.5):
                    print "Pearson value:", pearV[0]
                    SimM[i][j] = pearV[0]
                    SimM[j][i] = 0
            else:

                SimM[i][j] = SimM[j][i] = 0

    numpy.savetxt(name, SimM)




    return SimM, nTRs

Thanks

You should have a look at the indentation in your code sample (and edit your post to fix it!). It isn't right, and it does affect whether your program will run as presented. — DavidW, May 07 '15 at 18:59
What are you going to do with that 24000*24000 matrix? Perhaps you can work in batches and aggregate the partial results, so you never need the full 24000*24000 matrix in memory or on disk. — Warren Weckesser, May 07 '15 at 23:51
@DavidW Thank you for pointing that out. Indentation wasn't proper after pasting the code here. But the original code has proper indentation, meaning it is working and I have a performance issue. — mersa, May 08 '15 at 03:06
@WarrenWeckesser After obtaining the huge matrix I am going to threshold it based on a certain method (FDR), in which process I would need all the pearson values at the same time, so I have to keep them... Maybe I can figure something out ... I have to think about it. Thanks — mersa, May 08 '15 at 03:08

score 4 · Accepted Answer · answered May 07 '15 at 19:34

The main problem with your implementation is the amount of memory you'll need to store the correlation coefficients (at least 4.5GB). There is no reason to keep the already computed coefficients in memory. For problems like this, I like to use hdf5 to store the intermediate results since they work nicely with numpy. Here is a complete, minimal working example:

import numpy as np
import h5py
from scipy.stats import pearsonr

# Create the dataset
h5 = h5py.File("data.h5",'w')
h5["test"] = np.random.random(size=(24000,316))
h5.close()

# Compute dot products
h5 = h5py.File("data.h5",'r+')
A  = h5["test"][:]

N   = A.shape[0]
out = h5.require_dataset("pearson", shape=(N,N), dtype=float)

for i in range(N):
    out[i] = [pearsonr(A[i],A[j])[0] for j in range(N)]

Testing the first 100 rows suggests this will only take 8 hours on a single core. If you parallelized it, it should have linear speedup with the number of cores.

Thank you! I believe this will make everything so much faster and is what I needed to know! — mersa, May 08 '15 at 03:12

Pearson correlation on big numpy matrices

1 Answers1