3

I'm trying to fit Decision Tree model on UCI Adult dataset. I built the following pipeline to do so:

nominal_features = ['workclass', 'education', 'marital-status', 'occupation', 
                'relationship', 'race', 'sex', 'native-country']

nominal_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='most_frequent')),
    ('ohe', OneHotEncoder(handle_unknown='ignore'))
])

numeric_features = ['age', 'fnlwgt', 'capital-gain', 'capital-loss', 'hours-per-week']

numeric_transformer = Pipeline(steps=[
    ('scaler', StandardScaler())
])

preprocessor = ColumnTransformer(
    transformers=[
        ('numeric', numeric_transformer, numeric_features),
        ('nominal', nominal_transformer, nominal_features)
    ]) # remaining columns will be dropped by default

clf = Pipeline(steps=[
    ('preprocessor', preprocessor),
    ('classifier', DecisionTreeClassifier(criterion='entropy', random_state=0))
])

I then fit my model by calling

clf.fit(X_train, y_train)

Then, when I try to get feature importances,

clf.named_steps['classifier'].feature_importances_

I get an array of shape (104,)

array([1.39312528e-01, 1.92086014e-01, 1.15276068e-01, 4.01797967e-02,
       7.08805229e-02, 3.99687904e-03, 6.68727677e-03, 0.00000000e+00,
       1.02021005e-02, 5.06637671e-03, 7.97826949e-03, 5.64939616e-03,
       0.00000000e+00, 9.09583016e-04, 1.84022196e-03, 9.29047900e-04,
       1.74001682e-04, 8.55362503e-05, 2.32440522e-03, 4.65023589e-04,
       4.13278579e-03, 3.68265995e-03, 1.78503960e-02, 8.33035943e-03,
       6.94454768e-03, 1.75988171e-02, 5.40933687e-04, 7.51299294e-03,
       6.07480929e-03, 2.28627732e-03, 1.32219786e-03, 1.92990938e-01,
       1.18517448e-03, 1.61377248e-03, 5.72167000e-04, 1.34920904e-03,
       5.41685180e-03, 0.00000000e+00, 9.16416279e-03, 1.05824472e-02,
       3.07744966e-03, 3.07152204e-03, 5.06657379e-03, 5.21819782e-03,
       0.00000000e+00, 7.49534136e-03, 2.83936918e-03, 8.62398812e-03,
       5.78720378e-03, 5.37536831e-03, 2.99744077e-03, 1.87247908e-03,
       4.87696805e-04, 1.58422357e-03, 2.20761597e-03, 5.57396015e-03,
       1.17619435e-03, 1.87465473e-03, 4.08710965e-03, 6.73508851e-04,
       6.02887867e-03, 2.38887308e-03, 4.52029746e-03, 7.28018074e-05,
       5.13158297e-04, 2.66768058e-04, 0.00000000e+00, 3.28378333e-04,
       0.00000000e+00, 8.55362503e-05, 0.00000000e+00, 7.89886262e-04,
       1.84475320e-04, 1.37879652e-03, 0.00000000e+00, 3.27800552e-04,
       1.95189232e-04, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00,
       0.00000000e+00, 9.00792536e-04, 0.00000000e+00, 2.20606426e-04,
       5.82787439e-04, 4.85000896e-04, 5.33409400e-04, 0.00000000e+00,
       8.75840665e-04, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00,
       4.65546160e-04, 3.37472507e-04, 2.50837357e-04, 2.52474592e-04,
       0.00000000e+00, 1.47818105e-04, 3.06829767e-04, 3.73651596e-04,
       1.58778645e-04, 4.40566013e-03, 8.55362503e-05, 2.51672361e-04])

which is not correct as I only have 13 features. I know the reason for this is OneHotencoding.

How can get the actual feature importances?

Vadim Kotov
  • 8,084
  • 8
  • 48
  • 62
chesslad
  • 31
  • 3

2 Answers2

2

I am afraid you cannot get importances for your initial features here. Your decision tree does not know anything about them; the only thing it sees and knows about is the encoded ones, and nothing else.

You may want to try the permutation importance instead, which has several advantages over the tree-based feature importance; it is also easily applicable to pipelines - see Permutation importance using a Pipeline in SciKit-Learn.

desertnaut
  • 57,590
  • 26
  • 140
  • 166
  • Yes, I tried permutation importances and it works. So, whenever we have to onehot encode some features, feature importances is basically impossible to interpret? – chesslad Oct 19 '21 at 15:40
  • @chesslad Yes; plus, classical feature importance measures have not been very useful in practice, since different importance measures can give inconsistent results - see https://towardsdatascience.com/interpretable-machine-learning-with-xgboost-9ec80d148d27 . – desertnaut Oct 19 '21 at 17:02
0

Fundamentally, the importance of a data column can be obtained by summing the importances of all the features that are based on it. Identifying column-to-feature mappings could be a little difficult to do by hand, but you can always use automated tools for that.

For example, the SkLearn2PMML package can translate Scikit-Learn pipelines to PMML representation, and perform various analyses and transformations while doing so. The calculation of aggregate feature importances is well supported.

from sklearn2pmml import sklearn2pmml
from sklearn2pmml.pipeline import PMMLPipeline

pipeline = PMMLPipeline([
  ("preprocessor", preprocessor),
  ("classifier", clf)
])
pipeline.fit(X, y)
# Re-map the dynamic attribute to a static pickleable attribute
clf.pmml_feature_importances_ = clf.feature_importances_

sklearn2pmml(pipeline, "PipelineWithImportances.pmml.xml")
user1808924
  • 4,563
  • 2
  • 17
  • 20