0

I want to train an xgboost binary classifier. My training data with labels is in a txt file that has libsvms in it. I am working with an extremely imbalanced dataset, roughly 200 of one class and 66,000 of the other class. Due to that, an advisor told me to stay away from the standard train test split. Instead they told me to do "some k-fold cv". I was confused as I've only ever used kfold to boost model performance at the end and I dont understand how to use it to replace the train test split. I tried using xgb.cv and cross_val_score but I want a model that I can predict on and (unless I am misunderstanding) those dont output models that I could use to predict a label for a new point. Could someone help me. I feel like this is simple but maybe if I could see some code for this it would really help? Should I do the k - fold training manually? I am not even sure what to look for here. I was also told not to attempt to class balance on this classifier data as we need a baseline. Thanks in advance!

Heres what I have so far but this just gives me the score and not a model I can use to predict. I also have another version of the code where I use a dmatrix but thats essentially the same thing.

from numpy import mean
from sklearn.datasets import make_classification
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import RepeatedStratifiedKFold
from xgboost import XGBClassifier
# generate dataset
X, y = load_svmlight_file(file_path)
# define model
model = XGBClassifier()
# define evaluation procedure
cv = RepeatedStratifiedKFold(n_splits=5, n_repeats=3, random_state=1)
# evaluate model
scores = cross_val_score(model, X, y, scoring='accuracy', cv=cv, n_jobs=-1)
mean_accuracy = scores.mean()
print(f'Mean Accuracy: {mean_accuracy}')
sshen
  • 1
  • 1

2 Answers2

0

I'm not sure if you got incorrect advice or misinterpreted your advisor, but cross-validation is only for evaluating model performance and on its own doesn't handle imbalanced datasets. You need to either up/downsample data or update your loss function by scaling the imbalanced dataset accordingly.

In the case of xgboost, the parameter "scale_pos_weight" might be of interest to you. You can also look at this blog post. Hopefully this helps!

Suraj Shourie
  • 536
  • 2
  • 11
0

Heres what I have so far but this just gives me the score and not a model I can use to predict.

cross_val_score() can't do this. You need to use the slightly more flexible version: cross_validate(). It has a parameter, return_estimator, which returns the model being fit. It returns a dictionary like this:

{'test_score': array([1.  , 1.  , 0.95, 0.95, 0.95, 1.  , 1.  , 1.  , 0.9 , 1.  , 0.95,
        1.  , 0.95, 1.  , 1.  ]),
 'estimator': [...],
 ...
 }

You can get the scores using result['test_score'], and the estimators with result['estimator'].

It's worth noting that your question is ambiguous. You want the model that was used to do the prediction, but the cross validation is making multiple predictions with multiple models. In this example, I am picking the model which scored the best on its test set.

result = cross_validate(model, X, y, scoring='accuracy', cv=cv, n_jobs=-1, return_estimator=True)
best_estimator_idx = np.argmax(result['test_score'])
model = result['estimator'][best_estimator_idx]

You could also re-train a new model on the entire set of data.

Nick ODell
  • 15,465
  • 3
  • 32
  • 66