How to use SelectFromModel in sklearn to find the positively informative features for a class

Question

I think I understand that until recently people used the attribute coef_ to extract the most informative features from linear models in python's machine learning library sklearn. Now users get pointed to SelectFromModel instead. SelectFromModel allows to reduce the features based on a threshold. So something like the following code reduces the features down to those features which have an importance > 0.5. My question now: Is there any way to determine whether a feature is positivly or negatively discriminating for a class?

I have my data in a pandas dataframe called data, first column a list of filenames of text files, second column the label.

count_vect = CountVectorizer(input="filename", analyzer="word")
X_train_counts = count_vect.fit_transform(data["filenames"])
print(X_train_counts.shape)
tf_transformer = TfidfTransformer(use_idf=True)
traindata = tf_transformer.fit_transform(X_train_counts)
print(traindata.shape) #report size of the training data
clf = LogisticRegression()
model = SelectFromModel(clf, threshold=0.5)
X_transform = model.fit_transform(traindata, data["labels"])
print("reduced features: ", X_transform.shape)
#get the names of all features
words = np.array(count_vect.get_feature_names())
#get the names of the important features using the boolean index from model 
print(words[model.get_support()])

score 2 · Accepted Answer · answered Jun 22 '16 at 17:14

2

To my knowledge you need to stick back to the .coef_ method and see which coefficients are negative or positive. a negative coefficient obviously decreases the odds of that class to happen (so negative relationship), while a positive coefficient increases the odds the class to happen (so positive relationship).

However this method will not give you the significance, only the direction. You will need the SelectFromModel method to extract that.

answered Jun 22 '16 at 17:14

kazAnova

219
1
7

To be clear, this is not a feature of `SelectFromModel`, whose job is to summarise overall feature importance from e.g. `LogisticRegression`'s `coef_` attribute, and to select on that basis. `SelectFromModel`'s job isn't to analyse a model for you, but to select features! – joeln Jun 23 '16 at 07:10

How to use SelectFromModel in sklearn to find the positively informative features for a class

1 Answers1

Linked