I am working with an XGboost model in python, with a large dataset comprised of embeddings (x) and corresponding labels (y), I have about 30000 samples. The data is very imbalanced, with 8 different classes of labels. I am attempting to perform hyperparameter tuning (using RandomizedSearchCV). For some of the CV folds I get an error:
ValueError: Invalid classes inferred from unique values of `y`. Expected: [0 1 2 3 4 5 6 7], got [0 1 2 3 5 6 7 8].
Due to the different splitting each time (using stratified split), some splits do not have all the labels in both groups.
I searched the web a lot and couldn't find anything in this exact context, even though I imagine this should be a major issue for many imbalanced multiclass classifications.
My code:
y = y.values.astype(int)
le = LabelEncoder()
y = le.fit_transform(y)
xgb_base = XGBClassifier(objective='multi:softprob', learning_rate=LR)
cv = StratifiedGroupKFold(n_splits=NUM_CV)
# Create the random search Random Forest
xgb_random = RandomizedSearchCV(estimator=xgb_base, param_distributions=xgb_grid,
n_iter=NUM_ITER, cv=cv, verbose=2,
random_state=1)
# Fit the random search model
xgb_random.fit(X, y, groups=groups)
# Get the optimal parameters
xgb_random.best_params_
print(xgb_random.best_params_)