0

I am working on feature engineering for text classification. I am stuck at a point over choosing features. Majority of the literatures say tokenize the text and use them as features(remove stop words ,punctuations), but then you miss out on multi-word words like (Lung cancer) or phrases. So the question is how do I decide the ngram order and treat them as features?

2 Answers2

0

The relevant 2-gram (in this case Lung cancer) will appear by frequency.
Imagine the following text:

I know someone who has Lung cancer: Lung cancer is terrible disease.

2-gram vs Frequency

If you make a list of the 2-grams you'll end with lung cancer first; and other combinations ('has Lung'; 'hate Lung') second.
This is because certain groups of words represent something - and are therefore called repeatedly - and others are just connectors ('has' or 'hate') that form 2-grams 'circumstantially'. The key is to filter by frequency.

If you are having issues generating n-grams, I feel you might be using the wrong libraries/toolset.

mik
  • 356
  • 1
  • 9
  • 24
0

I would say that this highly depends on your training data. You can visualise distributions of bigrams and trigrams frequencies. This might give you an idea of the relevance of the n-gram order. You might also want to use noun chunks during your investigation. Relevant noun chunks (or parts of them) could appear often. It might give you a sense on how to select you n-grams.

Pascal Zoleko
  • 691
  • 6
  • 8