0

I am using scikit-learn in a binary text classification task and I want to identify the best features for 'each' class, therefore, based on them my classifier can identify the correct class..

Any directions how can I do that ? right now I am using CountVectorizer and I can see the most common feature for both classes but It doesn't show me which feature does belong to which class.. also probably the most common aren't always the best as it can be common for both classes which means it is not that good in identifying the sample as class a or b.

Here is what I am doing:

vec=CountVectorizer( tokenizer=tokens2,max_features=2000)
x=vec.fit_transform(X_train).toarray()

print x
print len(x[0]) # this should print the no. of feature which is 2000 in my case
print len(x) # this should print the no. of samples which is 980

I understand that max_features limits the vocabulary to the top K features only for both classes and I want it for 'each' classes. I've looked also to alvas answer in here but it seems that his code only works when the classifier is 'MultinomialNB' .. I've used it succefully with this classifier,however, when changing the classifier to DecisionTreeClassifier it rise the following error:

AttributeError: 'DecisionTreeClassifier' object has no attribute 'coef_'

and when changing the classifier to SVC 'linear' it prints different results which I couldn't understand.

0   (0, 22699)  2.2089966234e-05
  (0, 17115)    0.00011044983117
  (0, 17106)    2.2089966234e-05
  (0, 17096)    2.2089966234e-05
  (0, 17094)    2.2089966234e-05
  (0, 17079)    2.2089966234e-05
  (0, 17077)    2.2089966234e-05
  (0, 17064)    2.2089966234e-05
  (0, 17047)    2.2089966234e-05
  (0, 10872)    0.00011044983117
  (0, 10871)    8.83598649358e-05
.
.
1   (0, 22699)  2.2089966234e-05
  (0, 17115)    0.00011044983117
  (0, 17106)    2.2089966234e-05
  (0, 17096)    2.2089966234e-05
  (0, 17094)    2.2089966234e-05
  (0, 17079)    2.2089966234e-05
  (0, 17077)    2.2089966234e-05
  (0, 17064)    2.2089966234e-05
  (0, 17047)    2.2089966234e-05
  (0, 10872)    0.00011044983117
  (0, 10871)    8.83598649358e-05
  (0, 10870)    0.000198809696106
  (0, 10516)    0.00011044983117
  • Just want to mention also that the total no. of features that I have is 16908 and it can get more depending on the vectorizer parameters of course.. will this be a problem ? and when should I consider applying random projection ?
Community
  • 1
  • 1
Ophilia
  • 717
  • 1
  • 10
  • 25

1 Answers1

0

If you choose the DecisionTree or RandomForest you can access the feature_importances_ attribute in order to see what are the features that most contribute to the prediction.

But this a global contribution not for a single class.

dooms
  • 1,537
  • 3
  • 16
  • 30