Ngram order selection for feature engineering

Question

I am working on feature engineering for text classification. I am stuck at a point over choosing features. Majority of the literatures say tokenize the text and use them as features(remove stop words ,punctuations), but then you miss out on multi-word words like (Lung cancer) or phrases. So the question is how do I decide the ngram order and treat them as features?

score 0 · Answer 1 · answered Feb 02 '17 at 17:03

The relevant 2-gram (in this case Lung cancer) will appear by frequency.
Imagine the following text:

I know someone who has Lung cancer: Lung cancer is terrible disease.

If you make a list of the 2-grams you'll end with lung cancer first; and other combinations ('has Lung'; 'hate Lung') second.
This is because certain groups of words represent something - and are therefore called repeatedly - and others are just connectors ('has' or 'hate') that form 2-grams 'circumstantially'. The key is to filter by frequency.

If you are having issues generating n-grams, I feel you might be using the wrong libraries/toolset.

score 0 · Answer 2 · answered Nov 12 '19 at 21:33

I would say that this highly depends on your training data. You can visualise distributions of bigrams and trigrams frequencies. This might give you an idea of the relevance of the n-gram order. You might also want to use noun chunks during your investigation. Relevant noun chunks (or parts of them) could appear often. It might give you a sense on how to select you n-grams.

Ngram order selection for feature engineering

2 Answers2