I want to train an xgboost binary classifier. My training data with labels is in a txt file that has libsvms in it. I am working with an extremely imbalanced dataset, roughly 200 of one class and 66,000 of the other class. Due to that, an advisor told me to stay away from the standard train test split. Instead they told me to do "some k-fold cv". I was confused as I've only ever used kfold to boost model performance at the end and I dont understand how to use it to replace the train test split. I tried using xgb.cv and cross_val_score but I want a model that I can predict on and (unless I am misunderstanding) those dont output models that I could use to predict a label for a new point. Could someone help me. I feel like this is simple but maybe if I could see some code for this it would really help? Should I do the k - fold training manually? I am not even sure what to look for here. I was also told not to attempt to class balance on this classifier data as we need a baseline. Thanks in advance!
Heres what I have so far but this just gives me the score and not a model I can use to predict. I also have another version of the code where I use a dmatrix but thats essentially the same thing.
from numpy import mean
from sklearn.datasets import make_classification
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import RepeatedStratifiedKFold
from xgboost import XGBClassifier
# generate dataset
X, y = load_svmlight_file(file_path)
# define model
model = XGBClassifier()
# define evaluation procedure
cv = RepeatedStratifiedKFold(n_splits=5, n_repeats=3, random_state=1)
# evaluate model
scores = cross_val_score(model, X, y, scoring='accuracy', cv=cv, n_jobs=-1)
mean_accuracy = scores.mean()
print(f'Mean Accuracy: {mean_accuracy}')