How to Identify the best features that have been used to classify each class

Question

I am using scikit-learn in a binary text classification task and I want to identify the best features for 'each' class, therefore, based on them my classifier can identify the correct class..

Any directions how can I do that ? right now I am using CountVectorizer and I can see the most common feature for both classes but It doesn't show me which feature does belong to which class.. also probably the most common aren't always the best as it can be common for both classes which means it is not that good in identifying the sample as class a or b.

Here is what I am doing:

vec=CountVectorizer( tokenizer=tokens2,max_features=2000)
x=vec.fit_transform(X_train).toarray()

print x
print len(x[0]) # this should print the no. of feature which is 2000 in my case
print len(x) # this should print the no. of samples which is 980

I understand that max_features limits the vocabulary to the top K features only for both classes and I want it for 'each' classes. I've looked also to alvas answer in here but it seems that his code only works when the classifier is 'MultinomialNB' .. I've used it succefully with this classifier,however, when changing the classifier to DecisionTreeClassifier it rise the following error:

AttributeError: 'DecisionTreeClassifier' object has no attribute 'coef_'

and when changing the classifier to SVC 'linear' it prints different results which I couldn't understand.

0   (0, 22699)  2.2089966234e-05
  (0, 17115)    0.00011044983117
  (0, 17106)    2.2089966234e-05
  (0, 17096)    2.2089966234e-05
  (0, 17094)    2.2089966234e-05
  (0, 17079)    2.2089966234e-05
  (0, 17077)    2.2089966234e-05
  (0, 17064)    2.2089966234e-05
  (0, 17047)    2.2089966234e-05
  (0, 10872)    0.00011044983117
  (0, 10871)    8.83598649358e-05
.
.
1   (0, 22699)  2.2089966234e-05
  (0, 17115)    0.00011044983117
  (0, 17106)    2.2089966234e-05
  (0, 17096)    2.2089966234e-05
  (0, 17094)    2.2089966234e-05
  (0, 17079)    2.2089966234e-05
  (0, 17077)    2.2089966234e-05
  (0, 17064)    2.2089966234e-05
  (0, 17047)    2.2089966234e-05
  (0, 10872)    0.00011044983117
  (0, 10871)    8.83598649358e-05
  (0, 10870)    0.000198809696106
  (0, 10516)    0.00011044983117

Just want to mention also that the total no. of features that I have is 16908 and it can get more depending on the vectorizer parameters of course.. will this be a problem ? and when should I consider applying random projection ?

can you show us your steps , please? – Andy K Feb 12 '16 at 16:54 — Andy K, Feb 12 '16 at 16:54
I've edited my question and added them. – Ophilia Feb 12 '16 at 17:40 — Ophilia, Feb 12 '16 at 17:40

score 0 · Answer 1 · answered Apr 10 '16 at 08:07

0

If you choose the DecisionTree or RandomForest you can access the feature_importances_ attribute in order to see what are the features that most contribute to the prediction.

But this a global contribution not for a single class.

answered Apr 10 '16 at 08:07

dooms

1,537
3
16
30

How to Identify the best features that have been used to classify each class

1 Answers1