0

I am trying to make a predictive model based on text mining. I am confused how many features should I set up in my model. I have 1000 document in my analysis (so corpus will take around 700). Number of terms in corpus is around 20 000, so it exceeds number of documents (P >> N). Having so much features has any sense?

Number of features in HashingTF method should be higher than total numbers of terms in the corpus? Or should I make it smaller (like 512 features?)

I am a little bit confused.

Arthur G.
  • 93
  • 1
  • 10

1 Answers1

0

Assuming you are talking about using just unigrams as features, you are right that we want p < n. (Not citing sources here since you seem to know what this means.)

Finally, to achieve p < n, you could either

  1. select features with count>=k. Measure performance for various k and select the best k, or-

  2. use all features but with L1 regularization.

If you use hashing like you mentioned, you should set number of features less than even 512 because -

  1. n=700 and p=512 is still too skewed.
  2. Typically, there are a very small number of important words. It might even be less than 50 in your case. You could try number of hash buckets = {10, 20, 50, 100, 500, 1000} and pick the best one.

Good luck!

Aayush
  • 1,790
  • 2
  • 11
  • 9
  • I will try different numbers of features as you said. I thought also about bigrams. Does it change that problem a lot? Can i use a bigger number of features (like p = 2048)? – Arthur G. Jul 20 '17 at 11:25
  • Whether bigrams make a difference depends on the problem you are trying to solve. Normally, we want n>>p (curse of dimensionality). You could use a large p with L1-reg. If you dataset is small, you could just try them all. – Aayush Jul 20 '17 at 19:16