for my work, I have split the data and then used oversampling (due to imbalanced distribution) and feature selection. I want to use the classifier XGboost but I get the following error.
ValueError Traceback (most recent call last)
<ipython-input-16-ace98cb7898f> in <module>()
5 model.fit(X_train, y_train)
6 # make predictions for test data
----> 7 y_pred = model.predict(X_test)
8 predictions = [round(value) for value in y_pred]
9 # evaluate predictions
2 frames
/usr/local/lib/python3.7/dist-packages/xgboost/core.py in _validate_features(self, data)
1688
1689 raise ValueError(msg.format(self.feature_names,
-> 1690 data.feature_names))
1691
1692 def get_split_value_histogram(self, feature, fmap='', bins=None, as_pandas=True):
ValueError: feature_names mismatch.
Below is the code:
X_train, X_test, y_train, y_test = train_test_split(
features, label, test_size=0.50, random_state=42)
oversample = SMOTE()
X_train, y_train = oversample.fit_resample(X_train, y_train)
estimator = LogisticRegression()
selector = RFE(estimator, n_features_to_select=5, step=1)
selector = selector.fit(X_train, y_train)
model = XGBClassifier()
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
How can I solve the error knowing that oversampling and feature selection always happen after splitting the data?