I am practicing to use LSA to classify Enron dataset (all emails). My understanding is to successfully perform any further classification or clustering, I need to perform a lower rank approximation using TruncatedSVD to maximize the variance.
I have done all the pre-processing i could think of including 1) removing all punctuation 2) removing words less than 2 characters 3) remove documents with text size less than 1500 byte (tfidf works better with longer text) 4) remove stop words
However, if i set component to 100 per SKlearn suggests for LSA, i can only get 35% of variance (svd.explained_variance_ratio_.sum()). I tried with component = 2000, and can get 80%. ( i read somewhere saying one needs to get 90% variance as recommended?)
So my question is to perform a successful LSA, 1) how to test and pick the number of component 2) is high component number normal? 3) anything i can do to increase variance while keeping component number low?