I used the package pycaret to create a pipeline for my dataset. Now I want to use this pipeline and apply it to my own train and test splits that I create from the data. OneHotEncoding is one of the steps in my pipeline. Because of this, the transformed X_train has a different number of columns as my X_test data, due to the number of unique values for some columns being different.
To circumvent this issue, I had the idea to first fit the transform related steps of my pipeline to all my data and only fit the predictor (GradientBoostingClassifier) to the training data. To accomplish this, I have the two setups described below. The steps
Setup 1:
Here we first fit and transform all our data. From this transformed data we can now pull our train and test data and fit the predictor
setup 2: Here we first only fit the pipeline on all data. The train and test splits are taken directly from the original data and then get transformed using the fitted pipeline.
Both of these setups seem to work (as in, they give no error). However, in my case they both seem to produce different output (with the same seed). This is a major cause of concern since I assumed the result would be exactly the same. Below is the output.
accuracy : 0.732
Time taken: 171
accuracy : 0.805
Time taken: 184
I have 2 questions about this problem.
- Are these setups appropriate to deal with such a problem, or are there any other best practices that I am missing?
- What could be the reason that the accuracy scores are different? They should be identical, right?
code setup 1:
predictor = load_predictor(1)
predictor_pipeline = predictor[:-1]
predictor_model = predicto1[-1]
start = perf_counter()
print('Fitting pipeline')
X, y = dataset[config.INPUT_FEATURES + [target]], dataset[target]
X_transformed = predictor_pipeline.fit_transform(X, y)
train_set, test_set = create_sets()
train_df = tools.keep_subset(dataset,'index',train_set)
test_df = tools.keep_subset(dataset,'index',test_set)
train_indices = train_df.index
test_indices = test_df.index
# print('Adding weights')
train_df['sample_weights'] = self.add_sample_weights(train_df)
test_df['sample_weights'] = self.add_sample_weights(test_df)
X_train, y_train, weights_train = X_transformed.iloc[train_indices], train_df[target], train_df['sample_weights']
X_test, y_test, weights_test = X_transformed.iloc[test_indices], test_df[target], test_df['sample_weights']
print('Fitting predictor')
predictor_model.fit(X_train,y_train,sample_weight=weights_train)
preds = predictor_model.predict(X_test)
print(accuracy := accuracy_score(y_test, preds, sample_weight=weights_test))
print(f'Time taken: {perf_counter() - start}')
code setup 2:
target = 'won'
predictor = self.load_predictor(1)
predictor_pipeline = predictor[:-1]
predictor_model = predictor[-1]
start = perf_counter()
print('Fitting processor ..')
X, y = dataset[config.INPUT_FEATURES + [target]], dataset[target]
predictor_pipeline.fit(X, y)
train_set, test_set = create_sets()
train_df = tools.keep_subset(self.dataset,'index',train_set)
test_df = tools.keep_subset(self.dataset,'index',test_set)
print('Adding weights')
train_df['sample_weights'] = self.add_sample_weights(train_df)
test_df['sample_weights'] = self.add_sample_weights(test_df)
predictor_1 = self.load_predictor(1)
X_train, y_train, weights_train = train_df[config.INPUT_FEATURES + [target]], train_df[target], train_df['sample_weights']
X_test, y_test, weights_test = test_df[config.INPUT_FEATURES + [target]], test_df[target], test_df['sample_weights']
print('Processing training and test data ..')
X_train = predictor_pipeline.transform(X_train)
X_test = predictor_pipeline.transform(X_test)
print('Fitting model ...')
predictor_model.fit(X_train,y_train,sample_weight=weights_train)
preds = predictor_model.predict(X_test)
print(accuracy_score(y_test, preds, sample_weight=weights_test))
print(f'Time taken: {perf_counter() - start}')