Feature mismatch: Prediction through scikit-learn Pipeline

Question

I implemented the following scikit-learn pipeline inside a file called build.pyand later, pickled it successfully.

preprocessor = ColumnTransformer(transformers=[
        ('target', TargetEncoder(), COL_TO_TARGET),
        ('one_hot', OneHotEncoder(drop_invariant=False, handle_missing='value',
              handle_unknown='value', return_df=True, use_cat_names=True,
              verbose=0), COL_TO_DUM),
        ('construction', OrdinalEncoder(mapping=mapping),['ConstructionPeriod'])
      ], remainder='passthrough')

test_pipeline = Pipeline(steps=[
            ('preprocessor', preprocessor),
            ('std_scale', StandardScaler()),
            ('XGB_model', 
                xgb.XGBRegressor(
                    booster = 'gbtree', colsample_bylevel=0.75,colsample_bytree=0.75,
                    max_depth = 20,grow_policy = 'depthwise',learning_rate = 0.1
                 )
             )
        ])
test_pipeline.fit(X_train, y_train)

import pickle
pickle.dump(open('final_pipeline.pkl','wb'), test_pipeline)

The pickled pipeline is then read in a different file app.py, which accepts user data to make predictions via the unpickled pipeline.

pipeline = pickle.load(open('final_pipeline.pkl', 'rb'))

# data is the coming from the user via frontend
input_df = pd.DataFrame(data.dict(), index=[0])

# using the pipeline to predict 
prediction = pipeline.predict(input_df)

The challenge which I am encountering is the unpickled pipeline is expecting the incoming test data to have a column structure similar to the one utilized to train the pipeline (X_train).

To solve this, I need to order the incoming test data columns to match that of X_train.

Dirty solution, export the X_train columns names to a file and later read it inside app.py to rearrange the columns of the incoming test data.

Any suggestions on how to pythonically solve this?

score 0 · Answer 1 · answered Jun 07 '21 at 17:24

0

Your column order shouldn't be important but if it is then why not just sort the column in your pipeline and then sort them in your other code file. This way you won't have to do any local storing.

df = df.reindex(sorted(df.columns), axis=1)

answered Jun 07 '21 at 17:24

secretive

2,032
7
16

Thanks for replying. I looked up and found a lot of articles citing a similar problem. The reason why feature order is important is the pipeline converts the feature space to a matrix and hence the labels become obsolete. Your idea worked for me. – eager_learner Jun 09 '21 at 10:03

Feature mismatch: Prediction through scikit-learn Pipeline

1 Answers1