16

I'm pretty sure it's been asked before, but I'm unable to find an answer

Running Logistic Regression using sklearn on python, I'm able to transform my dataset to its most important features using the Transform method

classf = linear_model.LogisticRegression()
func  = classf.fit(Xtrain, ytrain)
reduced_train = func.transform(Xtrain)

How can I tell which features were selcted as most important? more generally how can I calculate the p-value of each feature in the dataset?

Salvador Dali
  • 214,103
  • 147
  • 703
  • 753
mel
  • 161
  • 1
  • 1
  • 3

3 Answers3

15

As suggested in comments above you can (and should) scale your data prior to your fit thus making the coefficients comparable. Below is a little code to show how this would work. I follow this format for comparison.

import numpy as np    
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler
import pandas as pd
import matplotlib.pyplot as plt

x1 = np.random.randn(100)
x2 = np.random.randn(100)
x3 = np.random.randn(100)

#Make difference in feature dependance
y = (3 + x1 + 2*x2 + 5*x3 + 0.2*np.random.randn()) > 0

X = pd.DataFrame({'x1':x1,'x2':x2,'x3':x3})

#Scale your data
scaler = StandardScaler()
scaler.fit(X) 
X_scaled = pd.DataFrame(scaler.transform(X),columns = X.columns)

clf = LogisticRegression(random_state = 0)
clf.fit(X_scaled, y)

feature_importance = abs(clf.coef_[0])
feature_importance = 100.0 * (feature_importance / feature_importance.max())
sorted_idx = np.argsort(feature_importance)
pos = np.arange(sorted_idx.shape[0]) + .5

featfig = plt.figure()
featax = featfig.add_subplot(1, 1, 1)
featax.barh(pos, feature_importance[sorted_idx], align='center')
featax.set_yticks(pos)
featax.set_yticklabels(np.array(X.columns)[sorted_idx], fontsize=8)
featax.set_xlabel('Relative Feature Importance')

plt.tight_layout()   
plt.show()
Keith
  • 4,646
  • 7
  • 43
  • 72
  • For brevity you can also use `scale` instead of `StandardScaler`: http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.scale.html – istewart Jan 20 '18 at 19:40
  • tou can scake your data in clf = LogisticRegression().fit(X/np.std(X, 0),y) – rafine Jul 07 '22 at 08:28
4

You can look at the coefficients in the coef_ attribute of the fitted model to see which features are most important. (For LogisticRegression, all transform is doing is looking at which coefficients are highest in absolute value.)

Most scikit-learn models do not provide a way to calculate p-values. Broadly speaking, these models are designed to be used to actually predict outputs, not to be inspected to glean understanding about how the prediction is done. If you're interested in p-values you could take a look at statsmodels, although it is somewhat less mature than sklearn.

BrenBarn
  • 242,874
  • 37
  • 412
  • 384
  • 3
    It is my understanding that the coefs_ size is not a measure for the feature importance. may you elobarate how should I look at the numbers? Thanks – mel Jun 17 '14 at 05:02
  • 1
    @mel: Looking at the source code, I can see that `LogisticRegression.transform` is indeed using `coef_` to evaluate the feature importance. It just considers coefficients with a higher absolute value to be more important. The relevant code is [here](https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/feature_selection/from_model.py). If you want some other definition of "importance" you'll need to explain what that is. – BrenBarn Jun 17 '14 at 05:21
  • 6
    Indeed, `np.abs(coef_)` is an awful attempt at quantifying feature importance - a concept which doesn't really make much sense anyway in a multivariate setting (i.e. the variables act jointly to make the prediction) unless your model does variable selection, e.g. through sparsity. If the model promotes sparsity, then you can discard the variables whose weights are zero, but that is technically all you can really do if you want to be rigorous. Some other models expose `feature_importance`, and depending on the model this is a more or less univariate measure of how well this feature explains dat – eickenberg Jun 17 '14 at 23:03
  • 1
    Couldn't you standardize your data to make the coefficients comparable? – Santosh Jul 20 '15 at 14:52
4

LogisticRegression.transform takes a threshold value that determines which features to keep. Straight from the docstring:

Threshold : string, float or None, optional (default=None) The threshold value to use for feature selection. Features whose importance is greater or equal are kept while the others are discarded. If "median" (resp. "mean"), then the threshold value is the median (resp. the mean) of the feature importances. A scaling factor (e.g., "1.25*mean") may also be used. If None and if available, the object attribute threshold is used. Otherwise, "mean" is used by default.

There is no object attribute threshold on LR estimators, so only those features with higher absolute value than the mean (after summing over the classes) are kept by default.

W4R10CK
  • 5,502
  • 2
  • 19
  • 30
Fred Foo
  • 355,277
  • 75
  • 744
  • 836