Based on this post, I can get covariance between two vectors using np.cov((x,y), rowvar=0)
. I have a matrix MxN and a vector Mx1. I want to find the covariance between each column of the matrix and the given vector. I know that I can use for
loop to write. I was wondering if I can somehow use np.cov()
to get the result directly.

- 985
- 3
- 11
- 25
-
1FYI: Unfortunately, the `y` argument of [`numpy.cov`](https://docs.scipy.org/doc/numpy/reference/generated/numpy.cov.html) doesn't do what one would hope. This is discussed in a [numpy issue on github](https://github.com/numpy/numpy/issues/2732). – Warren Weckesser Jan 05 '18 at 03:48
1 Answers
As Warren Weckesser said, the numpy.cov(X, Y)
is a poor fit for the job because it will simply join the arrays in one M by (N+1) array and find the huge (N+1) by (N+1) covariance matrix. But we'll always have the definition of covariance and it's easy to use:
A = np.sqrt(np.arange(12).reshape(3, 4)) # some 3 by 4 array
b = np.array([[2], [4], [5]]) # some 3 by 1 vector
cov = np.dot(b.T - b.mean(), A - A.mean(axis=0)) / (b.shape[0]-1)
This returns the covariances of each column of A with b.
array([[ 2.21895142, 1.53934466, 1.3379221 , 1.20866607]])
The formula I used is for sample covariance (which is what numpy.cov computes, too), hence the division by (b.shape[0] - 1). If you divide by b.shape[0]
you get the unadjusted population covariance.
For comparison, the same computation using np.cov
:
import numpy as np
A = np.sqrt(np.arange(12).reshape(3, 4))
b = np.array([[2], [4], [5]])
np.cov(A, b, rowvar=False)[-1, :-1]
Same output, but it takes about twice this long (and for large matrices, the difference will be much larger). The slicing at the end is because np.cov
computes a 5 by 5 matrix, in which only the first 4 entries of the last row are what you wanted. The rest is covariance of A with itself, or of b with itself.
Correlation coefficient
The correlation coefficientis obtained by dividing by square roots of variances. Watch out for that -1 adjustment mentioned earlier: numpy.var
does not make it by default, to make it happen you need ddof=1
parameter.
corr = cov / np.sqrt(np.var(b, ddof=1) * np.var(A, axis=0, ddof=1))
Check that the output is the same as the less efficient version
np.corrcoef(A, b, rowvar=False)[-1, :-1]
-
That is what I wanted! Thanks a lot for your help and clear explanations. Can you please tell me about the axis? Does axis=0 mean columns? I was reading that correlation is easier to compare than cov. Now that you calculated it based on definition, I think I can just divide "cov" by square root of each of the variances. I want to know the usage of axis, so I can use for that purpose as well. – Shannon Jan 05 '18 at 20:39
-
0 = row index, 1= column index. Saying axis=0 in `mean` means compute the mean running over the row index. The other index stays constant. So it ends up being the mean of each column. Try small examples like `np.array([[1,2], [4, 7]])` to see how it works. – Jan 05 '18 at 21:07
-
And I expanded the answer to address the correlation part, because it's a bit tricky. – Jan 05 '18 at 21:12
-
Thanks a million! It is clear now. I also tried an example with hand to clearly understands the index meaning. I got it, thanks. – Shannon Jan 05 '18 at 21:22
-
The answer was really great. I only have one improvement. The ```numpy``` package has already provided ```np.std()``` function to calculate standard deviation. Therefore the calculation for corr can be simplified as ```corr = cov / (np.std(b, ddof=1) * np.std(A, axis=0, ddof=1))```. The answer also has a deep understanding of sample and population statistics! – Fei Yao Jun 20 '19 at 14:49
-
BTW, I found that `b` should precede `A` otherwise `np.dot()` won't work fine. I guess it is because `numpy` broadcast rules but it is a little bit confusing for me. – Fei Yao Jun 20 '19 at 15:06