1

I am practicing to use LSA to classify Enron dataset (all emails). My understanding is to successfully perform any further classification or clustering, I need to perform a lower rank approximation using TruncatedSVD to maximize the variance.

I have done all the pre-processing i could think of including 1) removing all punctuation 2) removing words less than 2 characters 3) remove documents with text size less than 1500 byte (tfidf works better with longer text) 4) remove stop words

However, if i set component to 100 per SKlearn suggests for LSA, i can only get 35% of variance (svd.explained_variance_ratio_.sum()). I tried with component = 2000, and can get 80%. ( i read somewhere saying one needs to get 90% variance as recommended?)

So my question is to perform a successful LSA, 1) how to test and pick the number of component 2) is high component number normal? 3) anything i can do to increase variance while keeping component number low?

John Li
  • 43
  • 6
  • Forget to add, i use tfidf vectorizer. I tried count vectorizer, and it gave me 99% variance with component = 500. However i know tfidf is a better approach than pure count? – John Li Jul 13 '18 at 23:17

0 Answers0