0

In an ML project you first separate out your train and test data set and you carry out all your transformation on the train data set to to make sure information leakage doesn't take place. To be more precise:

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=111) 

Once you done the above you carry out all your:

OverSampling, UnderSampling Scale Dimensional Reduction using X_train and y_train

example: from sklearn.preprocessing import StandardScaler

sc = StandardScaler()
X_train = sc.fit_transform(X_train)
X_test = sc.transform(X_test)

pca = PCA(n_components = 0.95)
X_train = pca.fit_transform(X_train)
X_test = pca.transform(X_test)

I was under the impression that creating the PCA object from from sklearn.decomposition import PCA and first call on the train data set:

pca = PCA(n_components = 0.95)
pca.fit_transform(X_train)

and then on the test set

pca.transform(X_test)

Would or should get me the same dimensions the model was trained but unfortunately when I try calculate test error - model unseen data. I get the following:

X has 39 features, but DecisionTreeClassifier is expecting 10 features as input

Which is really puzzling to me because using the same PCA object it should transform the X_test into exact same dimensions. What am I missing here?

This is how the test error been calculated:

y_pred = tree_model.predict(X_test)
y_pred = tree_model.predict_proba(X_test)[:, 1]
add-semi-colons
  • 18,094
  • 55
  • 145
  • 232
  • what exactly is `tree_model`? Did you fit the `tree_model` in the `X_train` obtained from `PCA`? – Onyambu May 13 '21 at 02:15
  • its a `DecisionTreeClassifier` `from sklearn.tree import DecisionTreeClassifier` – add-semi-colons May 13 '21 at 02:18
  • There should be no error since both X_train and X_test have been transformed via PCA. the only advice I can give is to use different variable names ie `X_test_pca =pca.fit_transform(X_test)` and `X_train_pca = pca.transform(X_train)` then use `tree_model.fit(X_train_pca, Y_train)` `tree_model.predict(X_test_pca)` – Onyambu May 13 '21 at 02:28
  • @Onyambu I am actually using different variables. I just didn't translate that here. But something really strange happening I feel like pca object not retaining the infromation therefore its just giving all features. – add-semi-colons May 13 '21 at 02:36
  • Thats weird. I am quite sure that the X_train passed to the tree model and the x_test do have the same dimension after the PCA transformation. SO i Cannot tell as to why it is giving the error – Onyambu May 13 '21 at 02:38
  • Yeah, I am going to break a part the code and try a simple data set. Also only think I can think of is I have an oversampling on X_train prior to running Scale and PCA. Only thing I can think of is that causing but - that really shouldn't have any impact. Because thats just adding more data points. – add-semi-colons May 13 '21 at 13:24

0 Answers0