0

When using EmpiricalCovariance to develop a covariance matrix for high-dimensional data, I would expect the diagonal of this matrix (from the top-left to the bottom-right) to be all ones, as of course a variable is always going to perfectly correlate to itself. However, this is not the case. Why not?

Here is an example, plotted with a seaborns heatmap: Covariance matrix plotted as a heatmap. The diagonal from the top-left to the top-right is lighter than most of the rest of the data, but not the lightest points.

As you can see, the diagonal is lighter than most of the data, however it's not as light as the lightest point.

Ian
  • 5,704
  • 6
  • 40
  • 72

2 Answers2

2

If you look at the implementation of EmpiricalCovariance class and utility function that it invokes, you see that np.cov(data, bias=1) is (almost) the same as EmpiricalCovariance.fit(...).covariance_.

Lets do some experiments:

from sklearn.covariance import EmpiricalCovariance
import numpy as np

np.random.seed(10)
data = np.random.rand(10, 10)
np.allclose(EmpiricalCovariance().fit(data).covariance_, np.cov(data.T, bias=1))
# returns True !

From the numpy's official docs you could see that diagonal elements of covariance matrix are row-variances:

np.isclose(np.var(data[0]), np.cov(data, bias=1)[0][0])
# returns TRUE
bubble
  • 1,634
  • 12
  • 17
0

See this related thread from another SO post

In summary: what you see in the diagonals is the variance, not the correlation

jsga
  • 156
  • 1
  • 5