0

At the beginning, i had 400,000 images that were normalized (gray value increase). After that i did a DFT of each picture and got data of 400000 samples with 3200 absolute fourier-coefficients.

Now I would like to do a PCA and SVD. Since my data is already normalized and all values have the same units, I thought that I could use the "raw data" for the calculation. However, the eigenvalues of PCA and the singular values of SVD are different. (show image/link)

What am I doing wrong? How should the data be available for PCA and SVD? normalized,standardized, raw?

Please help me! Thank you

My Code:

# samples 400000x3200
# SVD
U,S,VT = svd(samples, full_matrices=False) 

tot_S = sum(S)
var_exp_S = [(i / tot_S) for i in S]
cum_var_exp_S = np.cumsum(var_exp_S)

# PCA
cov_mat = np.cov(samples.T)
eigen_vals, eigen_vecs = np.linalg.eig(cov_mat)
eigen_vals = np.asarray(sorted(eigen_vals,reverse=True))

tot = sum(eigen_vals)
var_exp = [(i / tot) for i in eigen_vals]
cum_var_exp = np.cumsum(var_exp)


num= 3200
plt.figure(figsize=(10,5))
plt.subplot(121)
plt.title('PCA')
plt.step(range(1,num+1),cum_var_exp[:num], where='mid',color='r')
plt.ylabel('share of variance')
plt.xlabel('principal components')
plt.legend()
plt.grid()

plt.subplot(122)
plt.title('SVD')
plt.step(range(1,num+1),cum_var_exp_S[:num], where='mid',color='r')
plt.ylabel('share of variance')
plt.xlabel('principal components')
plt.legend()
plt.grid()
Len01
  • 31
  • 5
  • the normalisation for the pca is done by cov(). numerical problems could be in issue. i would suggest to use some simple data from wikipedia or similar to validate your the type variables. (i dont remember the dependency between svd and pca since in my library I use power iteration instead of svd) – draz Apr 23 '20 at 12:27
  • First, thank you very much for your answer. I found a dependency for the svd function i'm using in my code: " the corresponding (possibly non-zero) eigenvalues are given by s**2." So regarding to your answer i don't need cov() because the data is already normalized? Then i can calculate the eigenvalues directly with the "raw data" by np.linalg.eigh((samples.T).dot(samples) ? – Len01 Apr 23 '20 at 12:50
  • no, i guess you need the SVD. I just do it witou because i dont want to implement the svd by myself. For the normalisation, the question is probably what you mean. there are some parts in PCA which do some normalisation. I put it into an answer...: – draz Apr 23 '20 at 16:11

3 Answers3

1

There are some "normalisations" within a PCA. Here the code from my own PCA library:


//normalize to center
centred = center( samples );
//normalize to square matrix
matrix = cov( centred );
//eigenvalue decomposition
vectors = evd( matrix );
//get normalized eigenvectors:
eigenvectors = get_eigenvectors( vectors );
//get eigenvalues:
eigenvalues = get_eigenvalues( vectors );

The relation between SVD and PCA is described as: The eigenvalues of M*M are the squares of the singular values of M.

draz
  • 793
  • 6
  • 10
  • So i should use PCA/SVD with the raw data? – Len01 Apr 23 '20 at 18:22
  • that depends on what you want to do with it. PCA normally does an SVD to get the results. what s your plan for the singular values? (I never used them before.) – draz Apr 23 '20 at 18:34
  • Ok i try to explain it with an example in an seperate answer... – Len01 Apr 23 '20 at 18:48
0

Here is an example:

import numpy as np
from scipy.linalg import svd
from sklearn.preprocessing import StandardScaler

X_train = np.asarray([[13.71,1.86,2.36,16.6],[12.22,1.29,1.94,19],
       [13.27,4.28,2.26,20],[13.16,3.57,2.15,21],
       [13.86,1.51,2.67,25]])

#PCA
sc = StandardScaler()
X_train_std = sc.fit_transform(X_train)

cov_mat = np.cov(X_train_std.T)
eigen_vals, eigen_vecs = np.linalg.eigh(cov_mat)
eigen_vals = np.asarray(sorted(eigen_vals, reverse=True))

#SVD
U,eigen_vals_S,V = svd(X_train)
eigen_vals_S2 = (eigen_vals_S**2)

print('\nEigenvalues \n%s' % eigen_vals)
print('\nS \n%s' %eigen_vals_S)
print('\nS**2 \n%s' %eigen_vals_S2)re

Output (Eigenvalues and S**2 are different):

Eigenvalues [2.79331043 1.28393579 0.90313734 0.01961644]

S [55.02775284 3.37434634 2.52347705 0.28664958]

S**2 [3.02805358e+03 1.13862132e+01 6.36793643e+00 8.21679822e-02]

now with same results:

#Same eigenvalues
eigen_vals, eigen_vecs = np.linalg.eigh((X_train.T).dot(X_train))
eigen_vals = np.asarray(sorted(eigen_vals, reverse=True))

#SVD
U,eigen_vals_S,V = svd(X_train)
eigen_vals_S2 = (eigen_vals_S**2)

print('\nEigenvalues \n%s' % eigen_vals)
print('\nS \n%s' %eigen_vals_S)
print('\nS**2 \n%s' %eigen_vals_S2)

Output:

Eigenvalues [3.02805358e+03 1.13862132e+01 6.36793643e+00 8.21679822e-02]

S [55.02775284 3.37434634 2.52347705 0.28664958]

S**2 [3.02805358e+03 1.13862132e+01 6.36793643e+00 8.21679822e-02]

And thats what i don't understand. The way with standardization and cov() is the method for PCA. But as you can see, there are different results, depending on how i calculate the eigenvalues....

Am i wrong with something or why is this so?

Len01
  • 31
  • 5
  • the svd returns singular values, not eigenvalues. as described in my answer, they are related but not equal – draz Apr 23 '20 at 19:01
  • I know, but with the description of svd, which says "In both cases the corresponding (possibly non-zero) eigenvalues are given by s**2." [link](https://docs.scipy.org/doc/numpy/reference/generated/numpy.linalg.svd.html) it should be the same, when i square the singular values...or not? – Len01 Apr 23 '20 at 19:06
  • as far as i understood, it is: eigenvalue_i( PCA( M * M ) ) == singular_value_i( M )^2 – draz Apr 23 '20 at 19:07
  • if you just want to verfiy the eigenvalues , there are other ways. i always reconstruct the samples with the scores of the pca which shows numerical problems. – draz Apr 23 '20 at 19:11
  • But the way i did it "manually" with standardization and cov() is PCA, isn't it? – Len01 Apr 23 '20 at 19:14
  • i would say so. for me, pca is some normalisation of data, building the matrix (using cov), and finally the tricky eigenvalue-decomposition. the eigenvalue-decomposition is normally implemented using svd, as far as i know. from the eigenvalues and eigenvectors, you calculate the pca-scores. – draz Apr 23 '20 at 19:20
  • in the first code snipped, you got the eigenvalues from cov( train ). in the second one from ( train.T ).dot( train ) - thats why you get different results. – draz Apr 23 '20 at 19:28
  • exactly that is the point i don't understand. For PCA everyone says, that you have to calcuate the covariance matrix: Cov(Train) = 1/(N-1)*(Train.T)*(Train) and then get the eigenvalues of this covariance matrix. But when i compute the eigenvalues "directly" with Train.T*Train, then the results are the same like with SVD and the square of singular values. So the problem is, why take the covariance matrix? – Len01 Apr 23 '20 at 19:45
  • actually, you do the PCA to get the eigenVECTORS. – draz Apr 23 '20 at 19:52
  • That's true. Nevertheless I was wondering about this issue. Especially if you look at the explanation from amoeba here [link](https://stats.stackexchange.com/questions/134282/relationship-between-svd-and-pca-how-to-use-svd-to-perform-pca) – Len01 Apr 23 '20 at 19:52
  • (but the eigenvalues are interesting because they represent the weight of a vector) – draz Apr 23 '20 at 19:55
  • so after the pca, you take the eigenvectors with the biggest eigenvalues, for exameple for feature extraction or compression. – draz Apr 23 '20 at 19:58
  • if your only interest is the eigenvalues, you dont need a full pca of course. – draz Apr 23 '20 at 20:00
  • Yeah so my idea for choosing the number of eigenvectors is a cumulative distribution of eigenvalues/singular values. Of course there is a difference in the distribution of both, but how i choose the right number of eigenvektors (see my answer with images..) – Len01 Apr 23 '20 at 20:52
  • the right number depends completely on what you want to do. at the beginning, choose about 80% of the cumulative value. then you take lets say 60% and 95% and you compare performance and quality of the result. if you have a real time application like lets say snap filter for video stream or google page rank, performance is more important. if the application is not time critical, you go for a higher cumulative value. – draz Apr 24 '20 at 06:56
  • You're welcome. I have to thanks for your help and i think i found out the problem with cov() and PCA here [link](https://stats.stackexchange.com/questions/301549/how-to-obtain-covariance-matrix-eigenvalues-from-singular-values?noredirect=1&lq=1). I've centered my data and calculated the cov matrix from scratch and with the answer from amoeba i know now that eigenvalues == (singularvalues**2)/(N-1). That's it. So from now on, i've never trust again just on the functions, better calculate by myself :) – Len01 Apr 24 '20 at 11:52
0

These images are from the dataset (400000x3200):

PCA

SVD

Len01
  • 31
  • 5