I am working on feature engineering for text classification. I am stuck at a point over choosing features. Majority of the literatures say tokenize the text and use them as features(remove stop words ,punctuations), but then you miss out on multi-word words like (Lung cancer) or phrases. So the question is how do I decide the ngram order and treat them as features?
2 Answers
The relevant 2-gram (in this case Lung cancer) will appear by frequency.
Imagine the following text:
I know someone who has Lung cancer: Lung cancer is terrible disease.
If you make a list of the 2-grams you'll end with lung cancer first; and other combinations ('has Lung'; 'hate Lung') second.
This is because certain groups of words represent something - and are therefore called repeatedly - and others are just connectors ('has' or 'hate') that form 2-grams 'circumstantially'. The key is to filter by frequency.
If you are having issues generating n-grams, I feel you might be using the wrong libraries/toolset.

- 356
- 1
- 9
- 24
I would say that this highly depends on your training data. You can visualise distributions of bigrams and trigrams frequencies. This might give you an idea of the relevance of the n-gram order. You might also want to use noun chunks during your investigation. Relevant noun chunks (or parts of them) could appear often. It might give you a sense on how to select you n-grams.

- 691
- 6
- 8