1

I'm working on neural networks and for reducing the dimensions of the term-document matrix constructed through documents and the various terms in it bearing the values of tf-idf , I need to apply PCA. Something Like this

           Term 1       Term 2       Term 3       Term 4. ..........
Document 1 

Document 2            tfidf values of terms per document

Document 3 
.
.
.
.
.

PCA works by getting the mean of the data and then subtracting the mean and then using the following formula for the covariance matrix

Let the matrix M be the term-document matrix of dimension NxN

The Covariance matrix becomes

( M x transpose(M))/N-1 

We then calculate the eigen values and the eigen vectors to feed as feature vectors in neural networks. What I'm not able to comprehend is the importance of covariance matrix and what dimensions is it finding the covariance of.

Because if we consider simple 2 dimensions X,Y,can be understood. What dimensions are being correlated here?

Thank you

Hooli
  • 711
  • 2
  • 13
  • 24
  • To my understanding the covariance matrix is there for the PCA to reduce the dimensions of the matrix. If two eigenvectors are highly correlated i.e. linearly dependent, you can drop one of them. – toxicate20 Nov 09 '12 at 11:52
  • Yes absolutely , sorry , my bad! – Hooli Nov 12 '12 at 19:03

1 Answers1

0

Latent semantic analysis describes this relation pretty well. It also explains how one uses first the full doc-term matrix, then the reduced one, to map lists (vectors) of terms to near-match docs -- i.e. why reduce.
See also making-sense-of-PCA-eigenvectors-eigenvalues. (The many different answers there suggest that no single one is intuitive for everybody.)

Community
  • 1
  • 1
denis
  • 21,378
  • 10
  • 65
  • 88