74

The classifiers in machine learning packages like liblinear and nltk offer a method show_most_informative_features(), which is really helpful for debugging features:

viagra = None          ok : spam     =      4.5 : 1.0
hello = True           ok : spam     =      4.5 : 1.0
hello = None           spam : ok     =      3.3 : 1.0
viagra = True          spam : ok     =      3.3 : 1.0
casino = True          spam : ok     =      2.0 : 1.0
casino = None          ok : spam     =      1.5 : 1.0

My question is if something similar is implemented for the classifiers in scikit-learn. I searched the documentation, but couldn't find anything the like.

If there is no such function yet, does somebody know a workaround how to get to those values?

desertnaut
  • 57,590
  • 26
  • 140
  • 166
tobigue
  • 3,557
  • 3
  • 25
  • 29
  • You mean the most discriminating parameter? – Simon Bergot Jun 20 '12 at 09:44
  • i'm not sure what you mean with parameters. i mean the most discriminating features, like in a bag-of-words model for spam classification, which words give the most evidence for each class. not the parameters which i understand as "settings" for the classifier - like learning rate etc. – tobigue Jun 20 '12 at 09:55
  • 11
    @eowl: in machine learning parlance, *parameters* are the settings generated by the learning procedure based on the *features* of your training set. Learning rate etc. are *hyperparameters*. – Fred Foo Jun 20 '12 at 14:44

9 Answers9

68

The classifiers themselves do not record feature names, they just see numeric arrays. However, if you extracted your features using a Vectorizer/CountVectorizer/TfidfVectorizer/DictVectorizer, and you are using a linear model (e.g. LinearSVC or Naive Bayes) then you can apply the same trick that the document classification example uses. Example (untested, may contain a bug or two):

def print_top10(vectorizer, clf, class_labels):
    """Prints features with the highest coefficient values, per class"""
    feature_names = vectorizer.get_feature_names()
    for i, class_label in enumerate(class_labels):
        top10 = np.argsort(clf.coef_[i])[-10:]
        print("%s: %s" % (class_label,
              " ".join(feature_names[j] for j in top10)))

This is for multiclass classification; for the binary case, I think you should use clf.coef_[0] only. You may have to sort the class_labels.

Jorge Leitao
  • 19,085
  • 19
  • 85
  • 121
Fred Foo
  • 355,277
  • 75
  • 744
  • 836
  • yeah in my cases i have only two classes, but with your code i was able to come up with the thing i wanted. thanks alot! – tobigue Jun 20 '12 at 19:35
  • @eowl: you're welcome. Did you take the `np.abs` of `coef_`? Because getting the highest-valued coefficients will only return the features that are indicative of the positive class. – Fred Foo Jun 20 '12 at 21:29
  • sth. like that... i sorted the list and took the head and tail, which allows you to still see what feature votes for what class. i posted my solution [below](http://stackoverflow.com/a/11140887/979377). – tobigue Jun 21 '12 at 14:58
  • 1
    For 2 classes, it looks like it is ``coef_`` rather than ``coef_[0]``. – Ryan R. Rosario Sep 12 '13 at 01:24
  • 2
    @RyanRosario: correct. In the binary case, `coef_` is flattened to save space. – Fred Foo Sep 12 '13 at 07:50
  • 4
    how are class_labels determined? I want to know the order of class labels. – Yandong Liu Feb 19 '14 at 04:55
  • @larsmans I tried to check out the example that you refer but it seems that the link is borken `An error has been encountered in accessing this page. `. Could you update the link?. Thanks! – john doe May 03 '15 at 19:58
  • 2
    You can get ordered classes from the classifier with `class_labels=clf.classes_` – wassname Sep 12 '15 at 09:09
  • @Fred Foo i have a stupid question - I was always thinking that [-x:] will be the last x elements and [x:] -the first. Why here you write [-10:] if we are looking for 10 most useful features? Sorry for this question but i really don't understand it – Polly Sep 28 '16 at 11:34
  • Shouldn't you sort by the absolute value of the cofficients? – Philip Dec 17 '17 at 10:05
  • @Jorge thanks for posting this code. This only prints the top 10 features for the positive class, correct? So if I am doing sentiment analysis and want the top features for predicting the negative class, would I instead want to get the terms associated with the smallest coefficients? – Jane Sully Jun 28 '18 at 21:47
55

With the help of larsmans code I came up with this code for the binary case:

def show_most_informative_features(vectorizer, clf, n=20):
    feature_names = vectorizer.get_feature_names()
    coefs_with_fns = sorted(zip(clf.coef_[0], feature_names))
    top = zip(coefs_with_fns[:n], coefs_with_fns[:-(n + 1):-1])
    for (coef_1, fn_1), (coef_2, fn_2) in top:
        print "\t%.4f\t%-15s\t\t%.4f\t%-15s" % (coef_1, fn_1, coef_2, fn_2)
tobigue
  • 3,557
  • 3
  • 25
  • 29
  • How do you call the function from main method ? what does f1 and f2 stand for? I am trying to call the function from Decision tree classifier with scikit-learn. –  Mar 30 '14 at 20:37
  • 1
    This code will only work with a linear classifier that has a `coef_` array, so unfortunately I don't think it is possible to use it with sklearn's decision tree classifiers. `fn_1` and `fn_2` stand for the feature names. – tobigue Mar 31 '14 at 06:54
16

To add an update, RandomForestClassifier now supports the .feature_importances_ attribute. This attribute tells you how much of the observed variance is explained by that feature. Obviously, the sum of all these values must be <= 1.

I find this attribute very useful when performing feature engineering.

Thanks to the scikit-learn team and contributors for implementing this!

edit: This works for both RandomForest and GradientBoosting. So RandomForestClassifier, RandomForestRegressor, GradientBoostingClassifier and GradientBoostingRegressor all support this.

ClimbsRocks
  • 994
  • 13
  • 15
13

We've recently released a library (https://github.com/TeamHG-Memex/eli5) which allows to do that: it handles variuos classifiers from scikit-learn, binary / multiclass cases, allows to highlight text according to feature values, integrates with IPython, etc.

Mikhail Korobov
  • 21,908
  • 8
  • 73
  • 65
  • If someone needs starting snippet: from eli5 import show_weights show_weights(model, vec=tfidf) – Darius Aug 21 '20 at 07:08
6

I actually had to find out Feature Importance on my NaiveBayes classifier and although I used the above functions, I was not able to get feature importance based on classes. I went through the scikit-learn's documentation and tweaked the above functions a bit to find it working for my problem. Hope it helps you too!

def important_features(vectorizer,classifier,n=20):
    class_labels = classifier.classes_
    feature_names =vectorizer.get_feature_names()

    topn_class1 = sorted(zip(classifier.feature_count_[0], feature_names),reverse=True)[:n]
    topn_class2 = sorted(zip(classifier.feature_count_[1], feature_names),reverse=True)[:n]

    print("Important words in negative reviews")

    for coef, feat in topn_class1:
        print(class_labels[0], coef, feat)

    print("-----------------------------------------")
    print("Important words in positive reviews")

    for coef, feat in topn_class2:
        print(class_labels[1], coef, feat)

Note that your classifier(in my case it's NaiveBayes) must have attribute feature_count_ for this to work.

Sai Sandeep
  • 61
  • 1
  • 2
1

You can also do something like this to create a graph of importance features by order:

importances = clf.feature_importances_
std = np.std([tree.feature_importances_ for tree in clf.estimators_],
         axis=0)
indices = np.argsort(importances)[::-1]

# Print the feature ranking
#print("Feature ranking:")


# Plot the feature importances of the forest
plt.figure()
plt.title("Feature importances")
plt.bar(range(train[features].shape[1]), importances[indices],
   color="r", yerr=std[indices], align="center")
plt.xticks(range(train[features].shape[1]), indices)
plt.xlim([-1, train[features].shape[1]])
plt.show()
Oleole
  • 381
  • 4
  • 21
0

RandomForestClassifier does not yet have a coef_ attrubute, but it will in the 0.17 release, I think. However, see the RandomForestClassifierWithCoef class in Recursive feature elimination on Random Forest using scikit-learn. This may give you some ideas to work around the limitation above.

Community
  • 1
  • 1
Daisuke Aramaki
  • 316
  • 3
  • 7
0

Not exactly what you are looking for, but a quick way to get the largest magnitude coefficients (assuming a pandas dataframe columns are your feature names):

You trained the model like:

lr = LinearRegression()
X_train, X_test, y_train, y_test = train_test_split(df, Y, test_size=0.25)
lr.fit(X_train, y_train)

Get the 10 largest negative coefficient values (or change to reverse=True for largest positive) like:

sorted(list(zip(feature_df.columns, lr.coef_)), key=lambda x: x[1], 
reverse=False)[:10]
slevin886
  • 261
  • 2
  • 10
0

First make a list, I give this list the name label. After that extracting all features name and column name I add in label list. Here I use naive bayes model. In naive bayes model, feature_log_prob_ give probability of features.

def top20(model,label):

  feature_prob=(abs(model.feature_log_prob_))

  for i in range(len(feature_prob)):

    print ('top 20 features for {} class'.format(i))

    clas = feature_prob[i,:]

    dictonary={}

    for count,ele in enumerate(clas,0): 

      dictonary[count]=ele

    dictonary=dict(sorted(dictonary.items(), key=lambda x: x[1], reverse=True)[:20])

    keys=list(dictonary.keys())

    for i in keys:

      print(label[i])

    print('*'*1000)
Jim Quirk
  • 606
  • 5
  • 18