In an ML project you first separate out your train and test data set and you carry out all your transformation on the train data set to to make sure information leakage doesn't take place. To be more precise:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=111)
Once you done the above you carry out all your:
OverSampling, UnderSampling Scale Dimensional Reduction using X_train and y_train
example: from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
X_train = sc.fit_transform(X_train)
X_test = sc.transform(X_test)
pca = PCA(n_components = 0.95)
X_train = pca.fit_transform(X_train)
X_test = pca.transform(X_test)
I was under the impression that creating the PCA object from from sklearn.decomposition import PCA and first call on the train data set:
pca = PCA(n_components = 0.95)
pca.fit_transform(X_train)
and then on the test set
pca.transform(X_test)
Would or should get me the same dimensions the model was trained but unfortunately when I try calculate test error - model unseen data. I get the following:
X has 39 features, but DecisionTreeClassifier is expecting 10 features as input
Which is really puzzling to me because using the same PCA object it should transform the X_test into exact same dimensions. What am I missing here?
This is how the test error been calculated:
y_pred = tree_model.predict(X_test)
y_pred = tree_model.predict_proba(X_test)[:, 1]