4

I have a sparse matrix (x) and an array (y). I would like to compute the correlation between each column in the matrix and the array. Shown below is a very simple approach which is slow. I was hoping somebody would have a faster/better approach.

import numpy as np
from scipy.sparse import rand as r1
from numpy.random import rand as r2

np.random.seed(1000)

nrow,ncol = 50000,4000
x = r1(nrow, ncol, format='csr', density=.05)
y = (r2(nrow)<=.6).astype(int)

correl = [(n,np.corrcoef(np.asarray(x[:,n].todense()).reshape(-1), y)[0,1]) for n in xrange(ncol)]
print correl[:10]
MaxU - stand with Ukraine
  • 205,989
  • 36
  • 386
  • 419
ironv
  • 978
  • 10
  • 25

1 Answers1

7

Using sparsity you can easily gain a speedup of >50x:

import numpy as np
from scipy.sparse import rand as r1
from numpy.random import rand as r2
from time import time

np.random.seed(1000)

nrow,ncol = 5000,4000
x = r1(nrow, ncol, format='csc', density=.05)
y = (r2(nrow)<=.6).astype(int)

t = []
t.append(time())
correl = [np.corrcoef(np.asarray(x[:,n].todense()).reshape(-1), y)[0,1] for n in range(ncol)]

t.append(time())

yy = y - y.mean()
xm = x.mean(axis=0).A.ravel()
ys = yy / np.sqrt(np.dot(yy, yy))
xs = np.sqrt(np.add.reduceat(x.data**2, x.indptr[:-1]) - nrow*xm*xm)

correl2 = np.add.reduceat(x.data * ys[x.indices], x.indptr[:-1]) / xs

t.append(time())

print('results equal --', np.allclose(correl, correl2))
print('run time (sec) -- OP: {}, new: {}'.format(*np.diff(t)))

Sample output:

results equal -- True
run time (sec) -- OP: 1.38134884834, new: 0.0178880691528

Explanation: To be able to take advantage of sparsity we standardise y which is dense anyway. And then compute the raw correlation between x and y. Because y is already zero-mean at this point the mean of x is nixed. It therefore remains to divide by the standard deviation of x. Here too we can avoid going through a dense matrix by calculating the raw 2nd moment and subtracting the squared mean.

Implementation detail: Please note that I have taken the liberty of switching to csc which is more suitable here (if needed, add x=x.tocsc() at the start). We use np.add.reduceat to perform the sums along the 'ragged' columns in a vectorized fashion. indices from the csc representation of the sparse matrix is convenient for selecting the elements of y corresponding to nonzero elements in x.

miguelmorin
  • 5,025
  • 4
  • 29
  • 64
Paul Panzer
  • 51,835
  • 3
  • 54
  • 99
  • 1
    Cool. This answer also is a very good way of explianing the purpose of the `indices`/`indptr` convention used in both `cs*` data formats – Daniel F Jan 11 '18 at 06:59
  • I tried this on a bag-of-words dataset 127,000 rows / instances and 241,000 features / columns / independent variables and most of the correlations were right but some were double their correlation (like -0.12 instead of -0.06) and others were "nan" compared to np.corrcoef(X[:,feature],y), but otherwise this is a really fast solution. – John May 16 '19 at 19:52
  • @John I can get similar errors by making the sparse matrix so sparse that some columns are empty. Could you check for this in your data set? – Paul Panzer May 17 '19 at 04:53
  • @Paul You're most certainly correct. The columns were empty causing the nan's. By the way this method works great for screening a large number of features in seconds. – John May 20 '19 at 15:03