5

I am trying to get the most important features for my GaussianNB model. The codes from here How to get most informative features for scikit-learn classifiers? or here How to get most informative features for scikit-learn classifier for different class? only work when I use MultinomialNB. How can I calculate or retrieve the most important features for each of my two classes (Fault = 1 or Fault = 0) otherwise? My code is: (not applied to text data)

df = df.toPandas()

X = X_df.values
Y = df['FAULT'].values.reshape(-1,1)


gnb = GaussianNB() 
y_pred = gnb.fit(X, Y).predict(X)

print(confusion_matrix(Y, y_pred))
print(accuracy_score(Y, y_pred))

Where X_df is a dataframe with binary columns for each of my features.

LN_P
  • 1,448
  • 4
  • 21
  • 37
  • [This accepted answer](https://stackoverflow.com/questions/11116697/how-to-get-most-informative-features-for-scikit-learn-classifiers) discusses getting features for only the binary classification case – G. Anderson Nov 27 '18 at 19:19
  • That's the example I cited: it only works for Bernoulli or Multinomial but not Gaussian – LN_P Nov 28 '18 at 09:44
  • You can use the permutation feature importance: https://scikit-learn.org/stable/modules/permutation_importance.html which is model agnostic and will tell you which feature is important. – glemaitre Dec 21 '19 at 18:04

1 Answers1

1

This is how I tried to understand the important features of the Gaussian NB. SKlearn Gaussian NB models, contains the params theta and sigma which is the variance and mean of each feature per class (For ex: If it is binary classification problem, then model.sigma_ would return two array and mean value of each feature per class).

neg = model.theta_[0].argsort()
print(np.take(count_vect.get_feature_names(), neg[:10]))

print('')

neg = model.sigma_[0].argsort()
print(np.take(count_vect.get_feature_names(), neg[:10]))

This is how I tried to get the important features of the class using the Gaussian Naive Bayes in scikit-learn library.

Rajesh Somasundaram
  • 448
  • 1
  • 4
  • 13