I implemented the following scikit-learn pipeline inside a file called build.py
and later, pickled it successfully.
preprocessor = ColumnTransformer(transformers=[
('target', TargetEncoder(), COL_TO_TARGET),
('one_hot', OneHotEncoder(drop_invariant=False, handle_missing='value',
handle_unknown='value', return_df=True, use_cat_names=True,
verbose=0), COL_TO_DUM),
('construction', OrdinalEncoder(mapping=mapping),['ConstructionPeriod'])
], remainder='passthrough')
test_pipeline = Pipeline(steps=[
('preprocessor', preprocessor),
('std_scale', StandardScaler()),
('XGB_model',
xgb.XGBRegressor(
booster = 'gbtree', colsample_bylevel=0.75,colsample_bytree=0.75,
max_depth = 20,grow_policy = 'depthwise',learning_rate = 0.1
)
)
])
test_pipeline.fit(X_train, y_train)
import pickle
pickle.dump(open('final_pipeline.pkl','wb'), test_pipeline)
The pickled pipeline is then read in a different file app.py
, which accepts user data to make predictions via the unpickled pipeline.
pipeline = pickle.load(open('final_pipeline.pkl', 'rb'))
# data is the coming from the user via frontend
input_df = pd.DataFrame(data.dict(), index=[0])
# using the pipeline to predict
prediction = pipeline.predict(input_df)
The challenge which I am encountering is the unpickled pipeline is expecting the incoming test data to have a column structure similar to the one utilized to train the pipeline (X_train).
To solve this, I need to order the incoming test data columns to match that of X_train.
- Dirty solution, export the X_train columns names to a file and later read it inside
app.py
to rearrange the columns of the incoming test data.
Any suggestions on how to pythonically solve this?