Good question. XGBoost
has been known to do well for imbalanced datasets, and includes a number of hyperparameters to help us get there.
For the scale_pos_weight
feature, XGBoost documentation suggests:
sum(negative instances) / sum(positive instances)
For extremely unbalanced datasets, some have suggested using the sqrt
of that formula above.
For weights, typically via the sample_weight
parameter in XGBoost, you can learn class_weights
via a sklearn utility, as described here.
The difference between the two is explored here, but in summary:
The sample_weight parameter allows you to specify a different weight
for each training example. The scale_pos_weight parameter lets you
provide a weight for an entire class of examples ("positive" class).
In code, you can see these implementations below, including the square root. Please note, I had to use synthetic data since none was provided in the question.
# General imports
import pandas as pd
from sklearn import datasets
from collections import Counter
# Generate datasets
from sklearn.datasets import make_classification
from imblearn.datasets import make_imbalance
# Train, test, splits and gridsearch optimization
from sklearn.model_selection import train_test_split, GridSearchCV
# Class weights
from sklearn.utils import class_weight
# Performance
from sklearn.metrics import classification_report
# Modeling
import xgboost
import warnings
warnings.filterwarnings('ignore')
# Generate synthetic data
X, y = make_classification(n_samples=10000, n_features=20, n_informative=15, class_sep=2.0, n_classes=2, n_clusters_per_class=5, hypercube=True, random_state=30)
scaled_X, scaled_y = make_imbalance(X, y, sampling_strategy={0:200}, random_state=8)
data = pd.DataFrame(data=scaled_X, columns=['feature_{}'.format(i) for i in range(X.shape[1])])
X_train, X_test, y_train, y_test = train_test_split(data, scaled_y, random_state=8, stratify=scaled_y)
# Compare 3 XGBoost models: no changes to weights, using sample weights, and using weight_scale
# Build a model without using the scale_pos_weight parameter, fit it, and get a set of its performance measures.
model_no_scale = xgboost.XGBClassifier(random_state=30)
model_no_scale.fit(X_train, y_train)
# Print performance
print("Off the Shelf XGBoost")
print(classification_report(y_test, model_no_scale.predict(X_test)))
# Get class_weights
# https://datascience.stackexchange.com/questions/16342/unbalanced-multiclass-data-with-xgboost
model_weights = xgboost.XGBClassifier(sample_weight=class_weight.compute_sample_weight(class_weight='balanced', y=scaled_y), random_state=30)
model_weights.fit(X_train, y_train)
# Print performance
print("Weights XGBoost")
print(classification_report(y_test, model_weights.predict(X_test)))
# Get the counts of the training data per XGBoost documentation
counts = Counter(y_train)
model_scale = xgboost.XGBClassifier(scale_pos_weight=counts[0] / counts[1], random_state=30)
model_scale.fit(X_train, y_train)
# Print performance
print("Scale XGBoost")
print(classification_report(y_test, model_scale.predict(X_test)))
# Get the counts of the training data per XGBoost documentation
from math import sqrt
model_sqrt = xgboost.XGBClassifier(scale_pos_weight=sqrt(counts[0] / counts[1]), random_state=30)
model_sqrt.fit(X_train, y_train)
# Print performance
print("SQRT XGBoost")
print(classification_report(y_test, model_sqrt.predict(X_test)))
Results in:
Off the Shelf XGBoost
precision recall f1-score support
0 0.95 0.38 0.54 50
1 0.98 1.00 0.99 1253
accuracy 0.98 1303
macro avg 0.96 0.69 0.77 1303
weighted avg 0.97 0.98 0.97 1303
Weights XGBoost
precision recall f1-score support
0 0.95 0.38 0.54 50
1 0.98 1.00 0.99 1253
accuracy 0.98 1303
macro avg 0.96 0.69 0.77 1303
weighted avg 0.97 0.98 0.97 1303
Scale XGBoost
precision recall f1-score support
0 0.73 0.64 0.68 50
1 0.99 0.99 0.99 1253
accuracy 0.98 1303
macro avg 0.86 0.82 0.83 1303
weighted avg 0.98 0.98 0.98 1303
SQRT XGBoost
precision recall f1-score support
0 0.96 0.46 0.62 50
1 0.98 1.00 0.99 1253
accuracy 0.98 1303
macro avg 0.97 0.73 0.81 1303
weighted avg 0.98 0.98 0.97 1303