Number of features in text mining

Question

I am trying to make a predictive model based on text mining. I am confused how many features should I set up in my model. I have 1000 document in my analysis (so corpus will take around 700). Number of terms in corpus is around 20 000, so it exceeds number of documents (P >> N). Having so much features has any sense?

Number of features in HashingTF method should be higher than total numbers of terms in the corpus? Or should I make it smaller (like 512 features?)

I am a little bit confused.

score 0 · Accepted Answer · answered Jul 20 '17 at 02:39

0

Assuming you are talking about using just unigrams as features, you are right that we want p < n. (Not citing sources here since you seem to know what this means.)

Finally, to achieve p < n, you could either

select features with count>=k. Measure performance for various k and select the best k, or-
use all features but with L1 regularization.

If you use hashing like you mentioned, you should set number of features less than even 512 because -

n=700 and p=512 is still too skewed.
Typically, there are a very small number of important words. It might even be less than 50 in your case. You could try number of hash buckets = {10, 20, 50, 100, 500, 1000} and pick the best one.

Good luck!

answered Jul 20 '17 at 02:39

Aayush

1,790
2
11
9

I will try different numbers of features as you said. I thought also about bigrams. Does it change that problem a lot? Can i use a bigger number of features (like p = 2048)? – Arthur G. Jul 20 '17 at 11:25
Whether bigrams make a difference depends on the problem you are trying to solve. Normally, we want n>>p (curse of dimensionality). You could use a large p with L1-reg. If you dataset is small, you could just try them all. – Aayush Jul 20 '17 at 19:16

Number of features in text mining

1 Answers1