How to correctly use model explainer with unseen data?

Question

I trained my classifier using a pipeline:

param_tuning = {

        'classifier__learning_rate': [0.01, 0.1],
        'classifier__max_depth': [3, 5, 7, 10],
        'classifier__min_child_weight': [1, 3, 5],
        'classifier__subsample': [0.5, 0.7],
        'classifier__n_estimators' : [100, 200, 500],
    }

cat_pipe = Pipeline(
    [
        ('selector', ColumnSelector(categorical_features)),
        ('encoder', ce.one_hot.OneHotEncoder())
    ]
)

num_pipe = Pipeline(
    [
        ('selector', ColumnSelector(numeric_features)),
        ('scaler', StandardScaler())
    ]
)

preprocessor = FeatureUnion(
    transformer_list=[

        ('cat', cat_pipe),
        ('num', num_pipe)
    ]
)

xgb_pipe = Pipeline(
    steps=[
        ('preprocessor', preprocessor),
        ('classifier', xgb.XGBClassifier())
    ]
)

grid = GridSearchCV(xgb_pipe, param_tuning, cv=5, n_jobs=-1, scoring='accuracy')

xgb_model = grid.fit(X_train, y_train)

The training data have categorical data, so the transformed data shape is (x , 100 ). After that, i try to explain model prediction on unseen data. Since i pass single unseen example directly to model, it preprocessed it in shape (x, 15) (because single observation does not have all examples all categorical data).

eli5.show_prediction(xgb['classifier'], xgb['preprocessor'].fit_transform(df), columns = xgb['classifier'].get_booster().feature_names))

And i got

ValueError: Shape of passed values is (1, 15), indices imply (1, 100).

This occurs because model was trained on whole preprocessed dataset with shape (x, 100), but i pass to explainer single observation with shape (1,15). How do i correctly pass unseen single observation to explainer?

"*single observation does not have all examples all categorical data*" - the requirement that it *does* have the same features is a very fundamental one, and it cannot be bypassed lightheartedly. If the extra features come from the pre-processing parts of the pipeline, the same should be done for unseen data as well. And, not quite sure what `df` is here, but we *never* do `.fit_transform` on unseen data - we use only `.transform` with the pre-processor we have already fitted with the training data. — desertnaut, Mar 20 '21 at 11:50
@desertnaut Thank you for your answer. df- is a single unseen data example. If a pass it to pipeline to transform - it returns (1, 15) shape — Alex Nikitin, Mar 20 '21 at 12:26

desertnaut · Accepted Answer · 2021-03-20T12:53:53.723

We never use .fit_transform() on unseen data; the correct way is to use the .transform() method of the pre-processor already fitted with your training data (here xgb['preprocessor']). That way, we ensure that the (transformed) unseen data have the same features with our (transformed) training ones, and so they are compatible with the model built with the latter.

So, you should replace .fit_transform(df) here:

eli5.show_prediction(xgb['classifier'], xgb['preprocessor'].fit_transform(df), columns = xgb['classifier'].get_booster().feature_names))

with .transform(df):

eli5.show_prediction(xgb['classifier'], xgb['preprocessor'].transform(df), columns = xgb['classifier'].get_booster().feature_names))

How to correctly use model explainer with unseen data?

1 Answers1