76

I am trying to figure out how to calculate covariance with the Python Numpy function cov. When I pass it two one-dimentional arrays, I get back a 2x2 matrix of results. I don't know what to do with that. I'm not great at statistics, but I believe covariance in such a situation should be a single number. This is what I am looking for. I wrote my own:

def cov(a, b):

    if len(a) != len(b):
        return

    a_mean = np.mean(a)
    b_mean = np.mean(b)

    sum = 0

    for i in range(0, len(a)):
        sum += ((a[i] - a_mean) * (b[i] - b_mean))

    return sum/(len(a)-1)

That works, but I figure the Numpy version is much more efficient, if I could figure out how to use it.

Does anybody know how to make the Numpy cov function perform like the one I wrote?

Thanks,

Dave

Dave
  • 1,326
  • 2
  • 11
  • 22
  • 1
    Minor remark, but you could take advantage of numpy for computing the sum: `sum = ((a - a_mean) * (b - b-mean)).sum()` – PlasmaBinturong Jan 16 '18 at 14:02
  • Why do you return `sum/(len(a)-1)`? Shouldn’t it be `sum/(len(a))`? Subtracting 1 from the length is a mistake – Connor Oct 08 '20 at 16:04
  • 2
    @Connor Covariance of a sample is described by a formula with N-1 in the denominator; for a population, it is described by N in the denominator. So it really depends on what is being calculated. – Goku Jul 03 '21 at 22:59
  • Yep you are correct – Connor Jul 03 '21 at 23:14

3 Answers3

141

When a and b are 1-dimensional sequences, numpy.cov(a,b)[0][1] is equivalent to your cov(a,b).

The 2x2 array returned by np.cov(a,b) has elements equal to

cov(a,a)  cov(a,b)

cov(a,b)  cov(b,b)

(where, again, cov is the function you defined above.)

unutbu
  • 842,883
  • 184
  • 1,785
  • 1,677
  • 11
    Thank you so much! I wish the documentation had explained it that well. That works perfectly. Once I had my own working function, I should have compared the result to the numpy.cov function and I'd probably have figured that out. I'd vote-up if I could, but I'm new and apparently can't. – Dave Mar 10 '13 at 01:59
  • 4
    This is incorrect: the `numpy.cov` defaults to calculating the *sample* covariance. The next answer explains it. In particular this should say `numpy.cov(a,b, bias=True)[0][1]` – WestCoastProjects Nov 22 '16 at 21:43
  • 8
    @javadba: Take a look at the OP's code. He is dividing by `(len(a)-1)` so his `cov` function is computing sample covariance. Therefore, my answer is correct.. – unutbu Nov 22 '16 at 23:10
  • 2
    ah ok. But you should add a note to that effect in your answer. I looked at your answer in isolation- and doubtful I were the only one to have done so. – WestCoastProjects Nov 22 '16 at 23:18
  • wouldn't cov(a,a) be... zero? And cov(a,b) = cov(b,a)? What is the point of returning 4 arrays? – john k Nov 05 '17 at 00:27
  • 2
    @johnktejik: [`cov(a,a) = var(a)`](https://en.wikipedia.org/wiki/Covariance#Properties). In general, the variance, `var(a)`, is not zero. – unutbu Nov 05 '17 at 00:35
36

Thanks to unutbu for the explanation. By default numpy.cov calculates the sample covariance. To obtain the population covariance you can specify normalisation by the total N samples like this:

numpy.cov(a, b, bias=True)[0][1]

or like this:

numpy.cov(a, b, ddof=0)[0][1]
  • 1
    **This** is the correct answer, not the accepted one - which omits the `bias=True` – WestCoastProjects Nov 22 '16 at 21:44
  • 1
    Turns out the accepted answer does work: but it does not include mention of the bias adjustment - which is buried in the OP. So this answer info is helpful. – WestCoastProjects Nov 22 '16 at 23:20
  • @ Lerner Zhang. Subtracting n_samples by one provides an unbiased estimator of the population covariance. See en.wikipedia.org/wiki/Bias_of_an_estimator –  May 04 '19 at 17:34
  • @Osian when you say "thanks to ubuntu" if that means you have a reference/source to cite please do so. Otherwise what does it mean? – brethvoice Aug 13 '21 at 20:02
  • 1
    @brethvoice I was referencing the accepted answer provided by unutbu (see above). Although the accepted answer is correct, it does not discuss the important distinction between calculating the sample & population covariance. –  Aug 20 '21 at 21:45
  • I did not realize there was a user named @unutbu (I also thought it was Ubuntu). That was the main source of my confusion; I thought you were talking about the Ubuntu user's manual or something: https://wiki.ubuntu.com/ubuntu-manual – brethvoice Aug 23 '21 at 12:19
  • So, by using numpy.cov, we have four times more calculations than necessary? We calculate cov(a,a), cov(a,b), cov(b,a) and cov(b,b), and we need only cov(a,b). – HD2000 Sep 21 '22 at 19:41
6

Note that starting in Python 3.10, one can obtain the covariance directly from the standard library.

Using statistics.covariance which is a measure (the number you're looking for) of the joint variability of two inputs:

from statistics import covariance

# x = [1, 2, 3, 4, 5, 6, 7, 8, 9]
# y = [1, 2, 3, 1, 2, 3, 1, 2, 3]
covariance(x, y)
# 0.75
Xavier Guihot
  • 54,987
  • 21
  • 291
  • 190