4

Based on this post, I can get covariance between two vectors using np.cov((x,y), rowvar=0). I have a matrix MxN and a vector Mx1. I want to find the covariance between each column of the matrix and the given vector. I know that I can use for loop to write. I was wondering if I can somehow use np.cov() to get the result directly.

Shannon
  • 985
  • 3
  • 11
  • 25
  • 1
    FYI: Unfortunately, the `y` argument of [`numpy.cov`](https://docs.scipy.org/doc/numpy/reference/generated/numpy.cov.html) doesn't do what one would hope. This is discussed in a [numpy issue on github](https://github.com/numpy/numpy/issues/2732). – Warren Weckesser Jan 05 '18 at 03:48

1 Answers1

7

As Warren Weckesser said, the numpy.cov(X, Y) is a poor fit for the job because it will simply join the arrays in one M by (N+1) array and find the huge (N+1) by (N+1) covariance matrix. But we'll always have the definition of covariance and it's easy to use:

A = np.sqrt(np.arange(12).reshape(3, 4))   # some 3 by 4 array 
b = np.array([[2], [4], [5]])              # some 3 by 1 vector
cov = np.dot(b.T - b.mean(), A - A.mean(axis=0)) / (b.shape[0]-1)

This returns the covariances of each column of A with b.

array([[ 2.21895142,  1.53934466,  1.3379221 ,  1.20866607]])

The formula I used is for sample covariance (which is what numpy.cov computes, too), hence the division by (b.shape[0] - 1). If you divide by b.shape[0] you get the unadjusted population covariance.

For comparison, the same computation using np.cov:

import numpy as np
A = np.sqrt(np.arange(12).reshape(3, 4))
b = np.array([[2], [4], [5]])
np.cov(A, b, rowvar=False)[-1, :-1]

Same output, but it takes about twice this long (and for large matrices, the difference will be much larger). The slicing at the end is because np.cov computes a 5 by 5 matrix, in which only the first 4 entries of the last row are what you wanted. The rest is covariance of A with itself, or of b with itself.

Correlation coefficient

The correlation coefficientis obtained by dividing by square roots of variances. Watch out for that -1 adjustment mentioned earlier: numpy.var does not make it by default, to make it happen you need ddof=1 parameter.

corr = cov / np.sqrt(np.var(b, ddof=1) * np.var(A, axis=0, ddof=1)) 

Check that the output is the same as the less efficient version

np.corrcoef(A, b, rowvar=False)[-1, :-1]
  • That is what I wanted! Thanks a lot for your help and clear explanations. Can you please tell me about the axis? Does axis=0 mean columns? I was reading that correlation is easier to compare than cov. Now that you calculated it based on definition, I think I can just divide "cov" by square root of each of the variances. I want to know the usage of axis, so I can use for that purpose as well. – Shannon Jan 05 '18 at 20:39
  • 0 = row index, 1= column index. Saying axis=0 in `mean` means compute the mean running over the row index. The other index stays constant. So it ends up being the mean of each column. Try small examples like `np.array([[1,2], [4, 7]])` to see how it works. –  Jan 05 '18 at 21:07
  • And I expanded the answer to address the correlation part, because it's a bit tricky. –  Jan 05 '18 at 21:12
  • Thanks a million! It is clear now. I also tried an example with hand to clearly understands the index meaning. I got it, thanks. – Shannon Jan 05 '18 at 21:22
  • The answer was really great. I only have one improvement. The ```numpy``` package has already provided ```np.std()``` function to calculate standard deviation. Therefore the calculation for corr can be simplified as ```corr = cov / (np.std(b, ddof=1) * np.std(A, axis=0, ddof=1))```. The answer also has a deep understanding of sample and population statistics! – Fei Yao Jun 20 '19 at 14:49
  • BTW, I found that `b` should precede `A` otherwise `np.dot()` won't work fine. I guess it is because `numpy` broadcast rules but it is a little bit confusing for me. – Fei Yao Jun 20 '19 at 15:06