Identifying the most useful words in differentiating between classes

Question

Is it possible to use tfidf (tfidfvectorizer in Python) to figure out which words are most important when trying to distinguish between two text classes (i.e., positive or negative sentiment, etc.)? For example, which words were most important for identifying the positive class, and then separately, which were most useful for identifying the negative class?

Are you familiar with PCA (Principal Component Analysis)? That's the idea you need, which will pull you out of the typical BoW or sentence vector paradigm, but should give you good results. — Prune, Jan 18 '17 at 21:05

score 1 · Answer 1 · answered Jan 19 '17 at 11:19

You can let scikit learn do your heavy lifting - train a random forest on your binary tree, extract the classifier's feature importance ranking and use it to get the most important words:

clf = RandomForestClassifier()
clf.fit(data, labels)

importances = clf.feature_importances_
np.argsort(importances)[::-1]

feature_names = vectorizer.get_feature_names()
top_words = []

for i in xrange(100):
    top_words.append(feature_names[indices[i]])

Note that this will only tell you what are the most important words - not what they say for each category. To say what each word say about each class you can classify the individual words and see what is their classification.

Another option is to take all positive/negative data samples, remove from them the word you are trying understand and see how this affects the classification of the sample.

Identifying the most useful words in differentiating between classes

1 Answers1