When I've used XGBoost for regression in the past, I've gotten differentiated predictions, but using an XGBClassifier on this dataset is resulting in all cases being predicted to have the same value. The true values of the test data are that 221 cases are a 0, and 49 cases are a 1. XGBoost seems to be latching onto that imbalance and predicting all 0's. I'm trying to figure out what I might need to adjust in the model's parameters to fix that.
Here is the code I'm running:
import pyreadstat
from xgboost import XGBClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
# Get data
dfloc = r"C:\Users\me\Desktop\Python practice\GBM_data.sav"
df, meta = pyreadstat.read_sav(dfloc, metadataonly=False)
# Filter data
df = df.dropna(subset=["Q31ar1"])
df = df.query("hgroup2==3")
IVs = ["Q35r1", "Q35r2", "Q35r3", "Q35r4", "Q35r5", "Q35r6", "Q35r7", "Q35r8", "Q35r9", "Q35r10", "Q35r11", "Q35r13", "Q35r14", "Q35r15", "Q35r16"]
# Separate samples
train, test = train_test_split(df, test_size=0.3, random_state=410)
train_features = train[IVs]
train_labels = train["Q31ar1"]
train_weight = train["WeightStack"]
test_features = test[IVs]
test_labels = test["Q31ar1"]
test_weight = test["WeightStack"]
# Set up model & params
model = XGBClassifier(objective = 'binary:logistic',
n_estimators = 1000,
learning_rate = .005,
subsample = .5,
max_depth = 4,
min_child_weight = 10,
tree_method = 'hist',
colsample_bytree = .5,
random_state = 410)
# Model
model.fit(train_features, train_labels, sample_weight = train_weight)
test_pred = model.predict(test_features)
Looking through some related questions, it seems like some people have had trouble with their models not going through enough boosting iterations. I'm running through 1000, which has been sufficient for regression in the past. Others were not setting the parameters correctly, but when I run model.get_params(), mine do appear to have been set; here's the output:
{'base_score': 0.5,
'booster': 'gbtree',
'colsample_bylevel': 1,
'colsample_bynode': 1,
'colsample_bytree': 0.5,
'gamma': 0,
'learning_rate': 0.005,
'max_delta_step': 0,
'max_depth': 4,
'min_child_weight': 10,
'missing': None,
'n_estimators': 1000,
'n_jobs': 1,
'nthread': None,
'objective': 'binary:logistic',
'random_state': 410,
'reg_alpha': 0,
'reg_lambda': 1,
'scale_pos_weight': 1,
'seed': None,
'silent': None,
'subsample': 0.5,
'verbosity': 1,
'tree_method': 'hist'}
Others have had issues with scaling. My predictors are all scaled the same way as is -- they're ordinal ratings scales, with values 1, 2, 3, 4, and 5. Still others have had trouble with NaNs, but I'm filtering my data to remove NaNs.
I'm wondering if I might need a different tree method or to mess around with the base_score parameter?
EDIT: Per Dan's comments, I tried a few things:
- I stratified my train/test split, and it didn't materially change -- 219 0's and 51 1's. The training sample has 507 0's and 120 1's, so their distribution is roughly even. I recognize this is a small dataset, but I'm a survey researcher, so this is all I've got.
- I tried logistic regression, and I got the same predictions: all 0's. Code:
from sklearn.linear_model import LogisticRegression
clf = LogisticRegression(random_state=0).fit(train_features, train_labels)
test_pred_log = clf.predict(test_features)
accuracy_log = clf.score(test_features, test_labels)
- I took a look at predictions on my training data from the XGBoost model, and they're also all 0's, so the ROC curve doesn't reveal much, but it was a good suggestion to look at the training predictions. The logistic model has the same training predictions: all 0.
train_pred = model.predict(train_features)
fpr, tpr, thresholds = roc_curve(train_labels, train_pred, pos_label=1)
- I didn't know I could get probability estimates, so thank you for the tip on
pred_proba
. My probability estimates are differentiated, so that's great! The probabilities for belonging to class 1 are just all lower -- averaging around 20%, which makes sense, since about 20% of the sample is truly in class 1. The problem is that I don't know how to adjust the threshold on the predictions. I suppose I could do it manually using the results frompred_proba
, but is there a way to work that into the estimator instead?