How to prepare feature vectors for text classification when the words in the text is not frequently repeating?

Question

I need to perform the text classification on set of emails. But all the words in my text are thinly sparse i.e frequency of each word with respect to all the documents are very less. words are not that much frequently repeating. Since to train the classifiers I think document term matrix with frequency as weightage is not suitable. Can you please suggest me what kind of other methods I need to use .

Thanks

score 0 · Accepted Answer · answered Mar 21 '16 at 07:14

The real problem will be, that if your words are that sparse a learned classifier will not generalise to the real world data. However, there are several solutions to it

1.) Use more data. This is kind-of a no-brainer. However, you can not only add labeled data you can also use unlabelled data in a semi-supervised learning

2.) Use more data (part b). You can look into the transfer learning setting. There you build a classifier on a large data set with similar characteristics. This might be twitter streams and then adapt this classifier to your domain

3.) Get your processing pipeline right. Your problem might origin from a suboptimal processing pipeline. Are you doing stemming? In the email the word steming should be mapped onto stem. This can be pushed even further by using synonym matching with a dictionary.

How to prepare feature vectors for text classification when the words in the text is not frequently repeating?

1 Answers1