6

I want to classifier text by using sklearn. first I used bag of words to training the data, the feature of bag of words are really large, more than 10000 features, so I reduced this feature by using SVD to 100.

But here I want to add some other features like # of words, # of positive words, # of pronouns etc. the additional features are only 10 less features, which compare to the 100 of bag of words feature are really small

From this situation I raise 2 questions:

  1. Is there some function in sklearn that can change the additional features' weight to make them more important?
  2. How do I check the additional feature is important to classifier?
SwiftArchitect
  • 47,376
  • 28
  • 140
  • 179
HAO CHEN
  • 1,209
  • 3
  • 18
  • 32
  • Sounds like you can simply append your additional features to your SVD features along the 1st axis, then train a classifier on the resulting matrix. There are a number of classifiers which allow you to see the feature importances, e.g. GradientBoostingClassifier. I don't think you can change the features' importances after training the classifier; their importances will reflect their usefulness in predicting your y. – Ryan Nov 28 '15 at 16:38
  • Thx, I mean, if there are some functions for test similarity between features and class? like before training the classifier, I got similarity rank, which give me idea that which features is important to classification? – HAO CHEN Nov 28 '15 at 17:41

1 Answers1

1

Although very much interest, I don't know the answer for the main question. In the meanwhile I can help with the second one.

After fitting a model you can access the feature importance through the attribute model.feature_importances_

I use the following function to normalize the importance and show it in a prettier way.

import matplotlib.pyplot as plt
import numpy as np
import seaborn as sns # (optional)

def showFeatureImportance(model):
    #FEATURE IMPORTANCE
    # Get Feature Importance from the classifier
    feature_importance = model.feature_importances_

    # Normalize The Features
    feature_importance = 100.0 * (feature_importance / Feature_importance.max())
    sorted_idx = np.argsort(feature_importance)
    pos = np.arange(sorted_idx.shape[0]) + .5

    #plot relative feature importance
    plt.figure(figsize=(12, 12))
    plt.barh(pos, feature_importance[sorted_idx], align='center', color='#7A68A6')
    plt.yticks(pos, np.asanyarray(X_cols)[sorted_idx])
    plt.xlabel('Relative Importance')
    plt.title('Feature Importance')
    plt.show()
fernandosjp
  • 2,658
  • 1
  • 25
  • 29