How to properly apply a sklearn pipeline to new data, two different setups tested

Question

I used the package pycaret to create a pipeline for my dataset. Now I want to use this pipeline and apply it to my own train and test splits that I create from the data. OneHotEncoding is one of the steps in my pipeline. Because of this, the transformed X_train has a different number of columns as my X_test data, due to the number of unique values for some columns being different.

To circumvent this issue, I had the idea to first fit the transform related steps of my pipeline to all my data and only fit the predictor (GradientBoostingClassifier) to the training data. To accomplish this, I have the two setups described below. The steps

Setup 1: Here we first fit and transform all our data. From this transformed data we can now pull our train and test data and fit the predictor

setup 2: Here we first only fit the pipeline on all data. The train and test splits are taken directly from the original data and then get transformed using the fitted pipeline.

Both of these setups seem to work (as in, they give no error). However, in my case they both seem to produce different output (with the same seed). This is a major cause of concern since I assumed the result would be exactly the same. Below is the output.

accuracy  : 0.732
Time taken: 171

accuracy  : 0.805
Time taken: 184

I have 2 questions about this problem.

Are these setups appropriate to deal with such a problem, or are there any other best practices that I am missing?
What could be the reason that the accuracy scores are different? They should be identical, right?

code setup 1:

predictor = load_predictor(1)
predictor_pipeline = predictor[:-1]
predictor_model = predicto1[-1]

start = perf_counter()

print('Fitting pipeline')
X, y = dataset[config.INPUT_FEATURES + [target]], dataset[target]
X_transformed = predictor_pipeline.fit_transform(X, y)

train_set, test_set = create_sets()

train_df = tools.keep_subset(dataset,'index',train_set)
test_df = tools.keep_subset(dataset,'index',test_set)

train_indices = train_df.index
test_indices = test_df.index

# print('Adding weights')
train_df['sample_weights'] = self.add_sample_weights(train_df)
test_df['sample_weights'] = self.add_sample_weights(test_df)

X_train, y_train, weights_train = X_transformed.iloc[train_indices], train_df[target], train_df['sample_weights']
X_test, y_test, weights_test = X_transformed.iloc[test_indices], test_df[target], test_df['sample_weights']

print('Fitting predictor')
predictor_model.fit(X_train,y_train,sample_weight=weights_train)
preds = predictor_model.predict(X_test)

print(accuracy := accuracy_score(y_test, preds, sample_weight=weights_test))
print(f'Time taken: {perf_counter() - start}')

code setup 2:

target = 'won'

predictor = self.load_predictor(1)
predictor_pipeline = predictor[:-1]
predictor_model = predictor[-1]

start = perf_counter()

print('Fitting processor ..')
X, y = dataset[config.INPUT_FEATURES + [target]], dataset[target]
predictor_pipeline.fit(X, y)

train_set, test_set = create_sets()

train_df = tools.keep_subset(self.dataset,'index',train_set)
test_df = tools.keep_subset(self.dataset,'index',test_set)

print('Adding weights')
train_df['sample_weights'] = self.add_sample_weights(train_df)
test_df['sample_weights'] = self.add_sample_weights(test_df)

predictor_1 = self.load_predictor(1)
X_train, y_train, weights_train = train_df[config.INPUT_FEATURES + [target]], train_df[target], train_df['sample_weights']
X_test, y_test, weights_test = test_df[config.INPUT_FEATURES + [target]], test_df[target], test_df['sample_weights']

print('Processing training and test data ..')
X_train = predictor_pipeline.transform(X_train)
X_test = predictor_pipeline.transform(X_test)

print('Fitting model ...')
predictor_model.fit(X_train,y_train,sample_weight=weights_train)
preds = predictor_model.predict(X_test)

print(accuracy_score(y_test, preds, sample_weight=weights_test))
print(f'Time taken: {perf_counter() - start}')

A minimal reproducible example would really help out here. The two methods should have the same result; the culprit could be your `create_sets` or `tools.keep_subset` not getting the same split in the two runs, or one of your pipeline steps not working correctly, or a random element in the transformers or model, or something else. (Also, you generally shouldn't fit transformers to the combined train/test set, but that's a subject for stats.SE or datascience.SE.) — Ben Reiniger, Mar 17 '22 at 19:52
Thanks for the answer! My suspect would be that it is a random element in one of the transformers. I'll dive deeper into the exact steps that it is taking. — Jeroen Vermunt, Mar 20 '22 at 17:54
Are you sure you have set `random_state=0` within your model that has been chained into your pipeline `Pipeline(steps=[('SGD', SGDRegressor(random_state=0))])` for your models and somewhere you split data into train & test again `train_test_split(X, y, ... random_state=0)` to get stable/reproducible results ? check this [post](https://stackoverflow.com/q/74042553/10452700) and its comment that might help — Mario, Jul 07 '23 at 17:47

How to properly apply a sklearn pipeline to new data, two different setups tested

0 Answers0