0

for my work, I have split the data and then used oversampling (due to imbalanced distribution) and feature selection. I want to use the classifier XGboost but I get the following error.

ValueError                                Traceback (most recent call last)
<ipython-input-16-ace98cb7898f> in <module>()
      5 model.fit(X_train, y_train)
      6 # make predictions for test data
----> 7 y_pred = model.predict(X_test)
      8 predictions = [round(value) for value in y_pred]
      9 # evaluate predictions

2 frames
/usr/local/lib/python3.7/dist-packages/xgboost/core.py in _validate_features(self, data)
   1688 
   1689                 raise ValueError(msg.format(self.feature_names,
-> 1690                                             data.feature_names))
   1691 
   1692     def get_split_value_histogram(self, feature, fmap='', bins=None, as_pandas=True):

ValueError: feature_names mismatch.

Below is the code:

X_train, X_test, y_train, y_test = train_test_split(
     features, label, test_size=0.50, random_state=42)

oversample = SMOTE()
X_train, y_train = oversample.fit_resample(X_train, y_train)
estimator = LogisticRegression()

selector = RFE(estimator, n_features_to_select=5, step=1)
selector = selector.fit(X_train, y_train)

model = XGBClassifier()
model.fit(X_train, y_train)
y_pred = model.predict(X_test)

How can I solve the error knowing that oversampling and feature selection always happen after splitting the data?

Prakash Dahal
  • 4,388
  • 2
  • 11
  • 25

1 Answers1

1

You have used feature selector in train data only. Which is the main reason for feature mismatch. You can match your features by applying the same instance to test data as well.

selector = RFE(estimator, n_features_to_select=5, step=1)
X_train = selector.fit_transform(X_train)
X_test = selector.transform(X_test)
Prakash Dahal
  • 4,388
  • 2
  • 11
  • 25