Calculating Covariance with Python and Numpy

Question

I am trying to figure out how to calculate covariance with the Python Numpy function cov. When I pass it two one-dimentional arrays, I get back a 2x2 matrix of results. I don't know what to do with that. I'm not great at statistics, but I believe covariance in such a situation should be a single number. This is what I am looking for. I wrote my own:

def cov(a, b):

    if len(a) != len(b):
        return

    a_mean = np.mean(a)
    b_mean = np.mean(b)

    sum = 0

    for i in range(0, len(a)):
        sum += ((a[i] - a_mean) * (b[i] - b_mean))

    return sum/(len(a)-1)

That works, but I figure the Numpy version is much more efficient, if I could figure out how to use it.

Does anybody know how to make the Numpy cov function perform like the one I wrote?

Thanks,

Dave

Minor remark, but you could take advantage of numpy for computing the sum: `sum = ((a - a_mean) * (b - b-mean)).sum()` — PlasmaBinturong, Jan 16 '18 at 14:02
Why do you return `sum/(len(a)-1)`? Shouldn’t it be `sum/(len(a))`? Subtracting 1 from the length is a mistake — Connor, Oct 08 '20 at 16:04
@Connor Covariance of a sample is described by a formula with N-1 in the denominator; for a population, it is described by N in the denominator. So it really depends on what is being calculated. — Goku, Jul 03 '21 at 22:59

unutbu · Accepted Answer · 2013-03-10T01:53:41.160

141

When a and b are 1-dimensional sequences, numpy.cov(a,b)[0][1] is equivalent to your cov(a,b).

The 2x2 array returned by np.cov(a,b) has elements equal to

cov(a,a)  cov(a,b)

cov(a,b)  cov(b,b)

(where, again, cov is the function you defined above.)

edited Mar 10 '13 at 01:53

answered Mar 10 '13 at 01:40

unutbu

842,883
184
1,785
1,677

11

Thank you so much! I wish the documentation had explained it that well. That works perfectly. Once I had my own working function, I should have compared the result to the numpy.cov function and I'd probably have figured that out. I'd vote-up if I could, but I'm new and apparently can't. – Dave Mar 10 '13 at 01:59
4

This is incorrect: the `numpy.cov` defaults to calculating the *sample* covariance. The next answer explains it. In particular this should say `numpy.cov(a,b, bias=True)[0][1]` – WestCoastProjects Nov 22 '16 at 21:43
8

@javadba: Take a look at the OP's code. He is dividing by `(len(a)-1)` so his `cov` function is computing sample covariance. Therefore, my answer is correct.. – unutbu Nov 22 '16 at 23:10
2

ah ok. But you should add a note to that effect in your answer. I looked at your answer in isolation- and doubtful I were the only one to have done so. – WestCoastProjects Nov 22 '16 at 23:18
wouldn't cov(a,a) be... zero? And cov(a,b) = cov(b,a)? What is the point of returning 4 arrays? – john k Nov 05 '17 at 00:27
2

@johnktejik: [`cov(a,a) = var(a)`](https://en.wikipedia.org/wiki/Covariance#Properties). In general, the variance, `var(a)`, is not zero. – unutbu Nov 05 '17 at 00:35

score 36 · Answer 2 · 2021-08-20T21:48:20.137

36

Thanks to unutbu for the explanation. By default numpy.cov calculates the sample covariance. To obtain the population covariance you can specify normalisation by the total N samples like this:

numpy.cov(a, b, bias=True)[0][1]

or like this:

numpy.cov(a, b, ddof=0)[0][1]

edited Aug 20 '21 at 21:48

answered Aug 23 '16 at 10:12

1

**This** is the correct answer, not the accepted one - which omits the `bias=True` – WestCoastProjects Nov 22 '16 at 21:44
1

Turns out the accepted answer does work: but it does not include mention of the bias adjustment - which is buried in the OP. So this answer info is helpful. – WestCoastProjects Nov 22 '16 at 23:20
@ Lerner Zhang. Subtracting n_samples by one provides an unbiased estimator of the population covariance. See en.wikipedia.org/wiki/Bias_of_an_estimator – May 04 '19 at 17:34
@Osian when you say "thanks to ubuntu" if that means you have a reference/source to cite please do so. Otherwise what does it mean? – brethvoice Aug 13 '21 at 20:02
1

@brethvoice I was referencing the accepted answer provided by unutbu (see above). Although the accepted answer is correct, it does not discuss the important distinction between calculating the sample & population covariance. – Aug 20 '21 at 21:45
I did not realize there was a user named @unutbu (I also thought it was Ubuntu). That was the main source of my confusion; I thought you were talking about the Ubuntu user's manual or something: https://wiki.ubuntu.com/ubuntu-manual – brethvoice Aug 23 '21 at 12:19
So, by using numpy.cov, we have four times more calculations than necessary? We calculate cov(a,a), cov(a,b), cov(b,a) and cov(b,b), and we need only cov(a,b). – HD2000 Sep 21 '22 at 19:41

Xavier Guihot · Answer 3 · 2021-10-06T09:38:17.660

6

Note that starting in Python 3.10, one can obtain the covariance directly from the standard library.

Using statistics.covariance which is a measure (the number you're looking for) of the joint variability of two inputs:

from statistics import covariance

# x = [1, 2, 3, 4, 5, 6, 7, 8, 9]
# y = [1, 2, 3, 1, 2, 3, 1, 2, 3]
covariance(x, y)
# 0.75

edited Oct 06 '21 at 09:38

answered May 13 '21 at 16:35

Xavier Guihot

54,987
21
291
190

1

Nice avoidance of numpy. – Peter Cotton Jan 07 '22 at 19:47
Hopefully statistics will include online versions too soon. In the meantime, https://github.com/microprediction/precise – Peter Cotton Jan 07 '22 at 19:47

Calculating Covariance with Python and Numpy

3 Answers3

Linked