PCA difference between python numpy and sklearn

Question

I am trying to implement the PCA using numpy.linalg.eig with two differents methods(with the covariance and the pca method used in eigenface) and I compare my results with the PCA from sklearn. But I observe that my results are differents so I was wondering which mistake I am doing. I have 3 samples, and each sample have 4 features. I am trying to reduce then the dimension of the samples to 3. EDIT : add with the SVD method. IThe results I get using the covariance PCA , the SVD and the PCA from sklearn are pretty close. But with the "eigenface" method it is totally different why ? import numpy as np from sklearn.decomposition import PCA

x = np.array([[0.5, 0.8, 1.5, -2.4], [-1.9, -8.7, 0.02, 4.9], [5.5,6.1, -8.1,3.0]])
print(x)
K = 3

# -- sklearn -- #
pca = PCA(n_components=K).fit(x)
res = pca.transform(x)
print('sklearn :', res)

# -- numpy covariance -- #
X = x - np.mean(x, axis = 0)  

cov = np.cov(X.T)
print("covariance :", cov.shape)

evals , evecs = np.linalg.eig(cov)

idx = np.argsort(evals)[::-1]
evecs = evecs[:,idx]
evals = evals[idx]

res2 = X.dot(evecs[:,:K]) 
print("numpy with cov:", res2)

# -- numpy scatter matrix -- #
X = x - np.mean(x, axis = 0)
C = np.dot(X, X.T)
evals , evecs = np.linalg.eig(C)
idx = np.argsort(evals)[::-1]
evecs = evecs[:,idx]
evals = evals[idx]

v = np.dot(evecs, X)
print("v :", v.shape)
res3= X[:, :K].dot(v)
print('numpy with scatter matrix : ', res3)

# -- numpy svd -- #
X = x - np.mean(x, axis = 0)  
U, S, V = np.linalg.svd(X, full_matrices=False)
U, V = svd_flip(U, V)
res2 = X.dot(V.T) 
print("numpy with svd:", res2)

score 2 · Accepted Answer · 2017-11-17T22:38:01.793

First of all: what does any of this mean? You have three points in a 4-dimensional space. They span a 2-dimensional plane. PCA finds a basis for this plane, and the coefficients of your points in that basis. Using Matlab's [C, S] = pca(x) for comparison: we get

C =
    0.4028    0.1082
    0.7895   -0.3198
   -0.4560   -0.5881
   -0.0806    0.7349

and

S =
   -0.5865   -5.8249
   -8.9674    3.1891
    9.5539    2.6357

which are matrices with the property that S*C' recovers centered data (which is X in your notation). The columns of C are basis vectors for the 2D subspace, the rows of S are the coordinates of your three points in that subset.

Sklearn returns

[[ -5.86525831e-01   5.82485371e+00  -2.65147201e-16]
 [ -8.96738194e+00  -3.18911605e+00   1.41061647e-16]
 [  9.55390777e+00  -2.63573766e+00  -5.28988843e-16]]

where the third column is noise (essentially zeros), reflecting that the points lie in a 2D plane; there is no 3rd principal component to be found. The first two columns match S from Matlab, except for sign choices.

Your computation of "NumPy with cov" does the same as sklearn, except the random noise in the 3rd column is different. By the way, for this computation you don't need to center the data; cov does it on its own. cov = np.cov(x.T) would work just as well.

[[ -5.86525831e-01  -5.82485371e+00   5.26721273e-16]
 [ -8.96738194e+00   3.18911605e+00   3.83725073e-15]
 [  9.55390777e+00   2.63573766e+00  -3.35763132e-15]]

"Eigenface" approach

The main idea here is that instead of computing np.dot(X.T, X) (essentially the covariance, up to a constant factor), we will work with the C = np.dot(X, X.T) which is smaller. The basis vectors we need will be obtained by multiplying the eigenvectors of C with X.T (if you are following Wikipedia's article, notice their T has different orientation from your X). However, these vectors are not normalized, unlike the vectors returned by np.linalg.eig. We'll have to normalize them before using:

X = x - np.mean(x, axis = 0)
C = np.dot(X, X.T)
evals , evecs = np.linalg.eig(C)
idx = np.argsort(evals)[::-1]
evecs = evecs[:,idx]
evals = evals[idx]
v = np.dot(X.T, evecs)
v /= np.linalg.norm(v, axis=0)
res3 = X.dot(v)

This returns

[[-0.58652583 -5.82485371  5.05711518]
 [-8.96738194  3.18911605  1.72266002]
 [ 9.55390777  2.63573766 -6.7797752 ]]

which is correct in the first two columns. Again, the third column is noise, but now it's noise that went through normalization, so it's not small at all. One has to understand that the third column is meaningless.

Thank you for the answer. I choose these data "randomly" the goal is to have more features than samples. In fact, I will try to implement the eigenface process using numpy.linal.eig for the pca implementation. So I will have n images with size of M (equal to the width x height) with n < M. — Shiro, Nov 17 '17 at 22:00
I maybe wrong about the name "scatter matrix", what I am trying to implement is a method used oftenly in articles about eigenface. The goal is to not compute the M eigenvalue but only the n best eigenvalue/eigenvectors. On wikipedia 'computing the eigenvector' part : https://en.wikipedia.org/wiki/Eigenface As the dimension of my data are differents from the wiki I tried to transpose them but it seems it was a bad idea. — Shiro, Nov 17 '17 at 22:01
thank you for the help. I have a last question, regarding the last eigenvector (3rd column). You said that it is noise because it is essentially close to 0 but is it caused by the fact its eigenvalue is very small compare to the others? So can we generalise that if the eigenvalue is very small compare to the previos eigenvalue then the eigenvectors is essentially composed of noise ? — Shiro, Nov 18 '17 at 15:53
In a way, yes: in PCA one can stop looking for new components once the eigenvalues drop dramatically. Still, one should consider the process in each particular situation to understand the meaning of the output. The 3rd eigenvector itself is meaningful, it's a vector perpendicular to the 2D plane containing our data. When it's multiplied by X, the results are theoretically 0; in practice they are round-off errors. Ideally they should be understood as "0 vector" then. With the eigenface approach that vector is divided by its norm, and then it's not meaningful anymore. — , Nov 18 '17 at 16:19

score 1 · Answer 2 · answered Nov 17 '17 at 20:03

1

My guess would be that your two methods of computing eigenvectors are giving different results than scipy.linalg.svd, which is what scipy's PCA implementation uses (https://github.com/scikit-learn/scikit-learn/blob/f3320a6f/sklearn/decomposition/pca.py#L399).

This could be a good place to start: Eigenvectors computed with numpy's eigh and svd do not match

answered Nov 17 '17 at 20:03

Noah_S

824
10
15

thank for the answer, that's true it used svd but does that mean that the eigenvector have to be different? I try to use instead svd (available on the edit) but I have the same problem some value are different. The results I get using the covariance PCA , the SVD and the PCA from sklearn are pretty close. But with the "eigenface" method it is totally different why ? – Shiro Nov 17 '17 at 21:40

PCA difference between python numpy and sklearn

2 Answers2

"Eigenface" approach