14

I am using input data from here (see Section 3.1).

I am trying to reproduce their covariance matrix, eigenvalues, and eigenvectors using scikit-learn. However, I am unable to reproduce the results as presented in the data source. I've also seen this input data elsewhere but I can't discern whether it's a problem with scikit-learn, my steps, or the data source.

data = np.array([[2.5,2.4],
                 [0.5,0.7],
                 [2.2,2.9],
                 [1.9,2.2],
                 [3.1,3.0],
                 [2.3,2.7],
                 [2.0,1.6],
                 [1.0,1.1],
                 [1.5,1.6],
                 [1.1,0.9],
                 ]) 

centered_data = data-data.mean(axis=0)
pca = PCA()
pca.fit(centered_data)
print(pca.get_covariance()) #Covariance Matrix

array([[ 0.5549,  0.5539],
   [ 0.5539,  0.6449]])

print(pca.explained_variance_ratio_) #Eigenvalues (normalized)

[ 0.96318131  0.03681869]

print(pca.components_) #Eigenvectors

[[-0.6778734  -0.73517866]
 [ 0.73517866 -0.6778734 ]]

Surprisingly, the projections matches the results from the data source described above.

print(pca.transform(centered_data)) #Projections

array([[-0.82797019,  0.17511531],
   [ 1.77758033, -0.14285723],
   [-0.99219749, -0.38437499],
   [-0.27421042, -0.13041721],
   [-1.67580142,  0.20949846],
   [-0.9129491 , -0.17528244],
   [ 0.09910944,  0.3498247 ],
   [ 1.14457216, -0.04641726],
   [ 0.43804614, -0.01776463],
   [ 1.22382056,  0.16267529]])

Here is what I don't understand:

  1. Why is the covariance matrix is different?
  2. Updated: How do I get eigenvalues from scikit-learn that are not already normalized?
Has QUIT--Anony-Mousse
  • 76,138
  • 12
  • 138
  • 194
slaw
  • 6,591
  • 16
  • 56
  • 109
  • 1
    Okay, I think I realize that the "explained_variance_ratio_" is NOT the same as the eigenvalues. Instead, they appear to be normalized over the sum of the eigenvalues. So, the "explained_variance_ratio_" are essentially normalized eigenvalues used for scree plots. Though, it's not clear how I can get the eigenvalues using scikit-learn. – slaw Dec 30 '14 at 04:36

2 Answers2

15

Correct covariance matrix of this data:

numpy.cov(data.transpose())
array([[ 0.61655556,  0.61544444],
       [ 0.61544444,  0.71655556]])

Biased (i.e. "incorrect", using the wrong normalization term, and underestimating variance in the data set) covariance matrix:

numpy.cov(data.transpose(), bias=1)
array([[ 0.5549,  0.5539],
       [ 0.5539,  0.6449]])

Numpy knows that you have to center your data - so you don't need centered_data.

PCA components are not 1:1 the eigenvalues.

Correct eigenvalue decomposition:

numpy.linalg.eig(numpy.cov(data.transpose()))
(array([ 0.0490834 ,  1.28402771]),
 array([[-0.73517866, -0.6778734 ],
        [ 0.6778734 , -0.73517866]]))

Using the biased estimator yields different Eigenvalues (again, underestimating variance), but the same Eigenvectors:

(array([ 0.04417506,  1.15562494]), ...

Note that the Eigenvectors are not yet sorted by the largest Eigenvalues.

As the name of pca.explained_variance_ratio_ indicates, these are not the Eigenvalues. They are the ratio. If we take the (biased, underestimating) eigenvalues, and normalize them to have a sum of 1, we get

s/sum(s)
array([ 0.03681869,  0.96318131])

Also, the pca.transform method of scipy apparently does not apply scaling. IMHO, when using PCA, it is also fairly common to scale each component to have unit variance. This obviously does not hold for this output. Then the result would be (with the two columns swapped, I did not bother to change this)

s, e = numpy.linalg.eig(numpy.cov(data.transpose()))
o=numpy.argsort(s)[::-1]
(data-mean).dot(e[:,o]) / numpy.sqrt(s[o])
array([[-0.73068047, -0.79041795],
       [ 1.56870773,  0.64481466],
       [-0.87561043,  1.73495337],
       [-0.24198963,  0.58866414],
       [-1.47888824, -0.94561319],
       [-0.80567404,  0.79117236],
       [ 0.08746369, -1.57900372],
       [ 1.01008049,  0.20951358],
       [ 0.38657401,  0.08018421],
       [ 1.08001688, -0.73426743]])

(As you can see, PCA is just three lines in numpy, so you don't need a function for this.)

Why do I think this is the proper result? Because the resulting data set has the property that it's covariance matrix is (except for rounding errors) the identity matrix. Without scaling, the covariance matrix is numpy.diag(s[o]). But one may also argue that by applying the scaling, I "lost" the variance information, that would have been kept otherwise.

In my opinion, scipy is using the wrong (biased) covariance. numpy is correct.

But more often than not, it does not matter much. In above ratio, the bias cancels out. And if you have a large data set, the difference between using the naive 1/n and the unbiased 1/(n-1) eventually becomes neglibile. But also the difference comes at effectively zero CPU cost, so you might as well use the unbiased variance estimate.

Has QUIT--Anony-Mousse
  • 76,138
  • 12
  • 138
  • 194
  • The reason why you want to use the "biased" version is that you have lost 1 degree of freedom by subtracting the mean of the data. In other words: Having N data points is N degrees of freedom (each vary independently). Now subtract the mean of all the data points. Imagine then that you somehow only knew N-1 data points. Would you be able to obtain the value of the Nth data point? Sure you would, because you know the mean and you know the N-1 other values. Thus, there are only N-1 degrees of freedom, that is the bias you must account for. – Jesper - jtk.eth May 10 '16 at 19:37
  • @denvar the "degrees of freedom" explanation seems to be *not* widely accepted as correct. Also, you mix up biased, and not biased. The `/N` is *biased* (because it underestimates systematically), the `/(N-1)` is unbiased. – Has QUIT--Anony-Mousse May 10 '16 at 20:13
  • Thanks for pointing out the biased issue. Do you have any reference on the statement that it is not widely accepted? – Jesper - jtk.eth May 10 '16 at 20:20
  • Or it was in the context of *weighted* samples. The degreed-of-freedom concept breaks down, but you still have bias. – Has QUIT--Anony-Mousse May 10 '16 at 21:18
  • @Anony-Mousse why isn't it necessary to center the data before `sklearn.decomposition.PCA` or `np.cov`? Also, I noticed you centered it at the very end with `(data-mean).dot(e[:,o]) / numpy.sqrt(s[o])` why did you center it here and divide by `sqrt(s[0])` . Thanks! – O.rka Jun 19 '16 at 19:54
  • The *definition* of covariance already includes centering. But to *project* using the resulting matrix, you still need to center. – Has QUIT--Anony-Mousse Jun 19 '16 at 21:25
1

The short answer to (1) is that when you applied PCA to your demeaned data, you have rotated it, and the new vector space expresses new random variables with different covariance. The answer to (2) is, if you want the non-normalized eigenvalues, just eigendecompose the covariance matrix of your data.

More info:

To compute eigenvalues using scipy: http://docs.scipy.org/doc/numpy/reference/generated/numpy.linalg.eigvals.html

You can instead compute the SVD of the data matrix (not covariance) and look at the singular values: http://docs.scipy.org/doc/numpy/reference/generated/numpy.linalg.svd.html

Apparently, scikit-learn has different flavors of SVD that you may want to try.

Lord Henry Wotton
  • 1,332
  • 1
  • 10
  • 11
  • 1
    Do you happen to know what "explained_variance_" means and how it relates to the eigenvalues? It appears that "explained_variance_" and "explained_variance_ratio_" are related by a normalization constant but the former does not match the eigenvalues. I also can't find seem to find ANY other example on the internet for using PCA – slaw Dec 30 '14 at 15:18
  • @slaw please look at [this](http://stats.stackexchange.com/questions/22569/pca-and-proportion-of-variance-explained?lq=1) post. – Lord Henry Wotton Dec 30 '14 at 20:20