0

I am working with an XGboost model in python, with a large dataset comprised of embeddings (x) and corresponding labels (y), I have about 30000 samples. The data is very imbalanced, with 8 different classes of labels. I am attempting to perform hyperparameter tuning (using RandomizedSearchCV). For some of the CV folds I get an error:

ValueError: Invalid classes inferred from unique values of `y`.  Expected: [0 1 2 3 4 5 6 7], got [0 1 2 3 5 6 7 8].

Due to the different splitting each time (using stratified split), some splits do not have all the labels in both groups.

I searched the web a lot and couldn't find anything in this exact context, even though I imagine this should be a major issue for many imbalanced multiclass classifications.

My code:

y = y.values.astype(int)
le = LabelEncoder()
y = le.fit_transform(y)

xgb_base = XGBClassifier(objective='multi:softprob', learning_rate=LR)

cv = StratifiedGroupKFold(n_splits=NUM_CV)
# Create the random search Random Forest
xgb_random = RandomizedSearchCV(estimator=xgb_base, param_distributions=xgb_grid,
                                n_iter=NUM_ITER, cv=cv, verbose=2,
                                random_state=1) 
# Fit the random search model
xgb_random.fit(X, y, groups=groups)

# Get the optimal parameters
xgb_random.best_params_


print(xgb_random.best_params_)
desertnaut
  • 57,590
  • 26
  • 140
  • 166
orly064
  • 11
  • 4
  • What are the counts of each class? Are the groups necessary? What are the counts of each class within those groups? – Ben Reiniger Jun 21 '22 at 03:46

1 Answers1

0

This is not a bug or error. Use StratifiedCV and see if that helps.

Why this is happening: Let us suppose you have 3 classes and 5 samples as [0,1,0,1,2]. So if even if you split it into 2 groups i.e k=2, either train or test won't have the class == 2. This is happening with your case.

If you have K > minimum number of samples per class, you'll definitely have this problem. If not, then StratifiedKFold can help. It'll split the data in a manner so that each split has almost the same distribution of classes.

On a broader note, if you can, then drop the classes not required or try merging two or more classes if you can.

Check this link to see the difference between different KFold types

Deshwal
  • 3,436
  • 4
  • 35
  • 94
  • Lemme know if that helped. – Deshwal Jun 20 '22 at 14:05
  • Thanks for the quick response! I am using StratifiedCV, specifically StratifiedGroupKFold. I understand that the data will not split in a way that I have all classes in both train and validation, but I can't figure out why I have the specific erro – orly064 Jun 20 '22 at 14:13
  • What is the error? Can you describe the traceback? – Deshwal Jun 20 '22 at 14:14
  • Because the mappings are all weird. That is why there's this error. for example, during training, it encoded the values of classes `[2,3,4,5,6] -> [0,,1,2,3,4]` but during CV time, it got `[0,3,4,5,6,7,8]`. How the model is supposed to handle those values? To compare? – Deshwal Jun 20 '22 at 14:17
  • Ok, I guess I understand what you are saying... but do you know how can I solve this? I will not have all labels in both training and validation... – orly064 Jun 20 '22 at 15:36
  • @orly064 What i the purpose of that kind of training when you can't either train or test on the **whole**? Why do you feel that this kind kind of model would be any good? If you're comfortable about any label not being in training, why not to use the suggestion and drop the whole class from data? – Deshwal Jun 21 '22 at 05:37
  • Hi @Deshwal, thanks a lot for the discussion! Indeed this may be the solution... I was probably too deep in trying to solve the error that I didn't consider just dropping the specific labels. – orly064 Jun 21 '22 at 06:13
  • Glad I could help. We all reach this stage once in a while no matter how experienced we are!!!! – Deshwal Jun 21 '22 at 07:22