1

I'm wondering how I can extract feature importances from Logistic regression, GBM and XGBoost in scikit-learn with the feature names when using the classifier in a pipeline with preprocessing. I want to know how do I extract feature importances from a Sklearn pipeline?

From the brief research I've done, I am not sure if this is possible with scikit-learn.

I also found a package called ELI5 (https://eli5.readthedocs.io/en/latest/overview.html) that is supposed to fix that issue with sci-kit-learn but I would like to compare it with feature importance results.

Please see the code below:-

# Defining regressand(Y) and regressors(X)
X = df3.drop(['Prediction_SAP_Burst','Unnamed: 0'], axis=1)
y = df3['Prediction_SAP_Burst']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=.30, random_state=1,stratify=y)
print(X_train.shape)
print(X_test.shape)
print(y_train.shape)
print(y_test.shape)

numerical_features = list(set(X.columns.to_list()))

numerical_transformer = Pipeline(steps=[
    ('scaler', MinMaxScaler())])

#Link all the transformers together in a ColumnTransformer

from sklearn.compose import ColumnTransformer

preprocessor = ColumnTransformer(
    transformers=[
        ('num', numerical_transformer, numerical_features)
            ])

from xgboost import XGBClassifier

# create a pipeline for each classifier.

lr_clf = Pipeline(steps=[('preprocessor', preprocessor),
                      ('lr', LogisticRegression(class_weight={0:0.52,1:16.14},random_state=1))])

xgb_clf = Pipeline(steps=[('preprocessor', preprocessor),
                      ('classifier', XGBClassifier(scale_pos_weight=8, random_state=1))])

gbc_clf = Pipeline(steps=[('preprocessor', preprocessor),
                      ('classifier', GradientBoostingClassifier(random_state=1, learning_rate=0.2, max_features=0.5,
                      n_estimators=250, subsample=0.8))])

lr_clf.fit(X_train, y_train)
xgb_clf.fit(X_train, y_train)
gbc_clf.fit(X_train, y_train)

#compute classification report and confusion matrix

def results(name: str, model: BaseEstimator) -> None:
    preds = model.predict(X_test)

    model_cv = cross_validate(model, X_train, y_train, cv=StratifiedKFold(n_splits=5), n_jobs=-1, scoring='f1')
    print(f"Kfold precision score: {model_cv['test_score']}")
    print(f"Average score of Kfold: {model_cv['test_score'].mean():.3f} +/- {model_cv['test_score'].std():.3f}")

    print(name + " score: %.3f" % model.score(X_test, y_test))
    print(classification_report(y_test, preds))
    labels = ['Good', 'Bad']

    conf_matrix = confusion_matrix(y_test, preds, normalize='true')

    font = {'family' : 'normal',
            'size'   : 14}

    plt.rc('font', **font)
    plt.figure(figsize= (10,6))
    sns.heatmap(conf_matrix, xticklabels=labels, yticklabels=labels, annot=True, cmap='Blues')
    plt.title("Confusion Matrix for " + name)
    plt.ylabel('True Class')
    plt.xlabel('Predicted Class')

results("Logistic Regression" , lr_clf)
results("X Gradient Boost" , xgb_clf)
results("Gradient boost Classifier" , gbc_clf)

MOT
  • 81
  • 6

0 Answers0