1

When I've used XGBoost for regression in the past, I've gotten differentiated predictions, but using an XGBClassifier on this dataset is resulting in all cases being predicted to have the same value. The true values of the test data are that 221 cases are a 0, and 49 cases are a 1. XGBoost seems to be latching onto that imbalance and predicting all 0's. I'm trying to figure out what I might need to adjust in the model's parameters to fix that.

Here is the code I'm running:

import pyreadstat
from xgboost import XGBClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

# Get data
dfloc = r"C:\Users\me\Desktop\Python practice\GBM_data.sav"
df, meta = pyreadstat.read_sav(dfloc, metadataonly=False)

# Filter data
df = df.dropna(subset=["Q31ar1"])
df = df.query("hgroup2==3")
IVs = ["Q35r1", "Q35r2", "Q35r3", "Q35r4", "Q35r5", "Q35r6", "Q35r7", "Q35r8", "Q35r9", "Q35r10", "Q35r11", "Q35r13", "Q35r14", "Q35r15", "Q35r16"]

# Separate samples
train, test = train_test_split(df, test_size=0.3, random_state=410)

train_features = train[IVs]
train_labels = train["Q31ar1"]
train_weight = train["WeightStack"]

test_features = test[IVs]
test_labels = test["Q31ar1"]
test_weight = test["WeightStack"]

# Set up model & params
model = XGBClassifier(objective = 'binary:logistic',
                     n_estimators = 1000,
                     learning_rate = .005,
                     subsample = .5,
                     max_depth = 4,
                     min_child_weight = 10,
                     tree_method = 'hist',
                     colsample_bytree = .5,
                     random_state = 410)

# Model
model.fit(train_features, train_labels, sample_weight = train_weight)
test_pred = model.predict(test_features)

Looking through some related questions, it seems like some people have had trouble with their models not going through enough boosting iterations. I'm running through 1000, which has been sufficient for regression in the past. Others were not setting the parameters correctly, but when I run model.get_params(), mine do appear to have been set; here's the output:

{'base_score': 0.5,
 'booster': 'gbtree',
 'colsample_bylevel': 1,
 'colsample_bynode': 1,
 'colsample_bytree': 0.5,
 'gamma': 0,
 'learning_rate': 0.005,
 'max_delta_step': 0,
 'max_depth': 4,
 'min_child_weight': 10,
 'missing': None,
 'n_estimators': 1000,
 'n_jobs': 1,
 'nthread': None,
 'objective': 'binary:logistic',
 'random_state': 410,
 'reg_alpha': 0,
 'reg_lambda': 1,
 'scale_pos_weight': 1,
 'seed': None,
 'silent': None,
 'subsample': 0.5,
 'verbosity': 1,
 'tree_method': 'hist'}

Others have had issues with scaling. My predictors are all scaled the same way as is -- they're ordinal ratings scales, with values 1, 2, 3, 4, and 5. Still others have had trouble with NaNs, but I'm filtering my data to remove NaNs.

I'm wondering if I might need a different tree method or to mess around with the base_score parameter?

EDIT: Per Dan's comments, I tried a few things:

  1. I stratified my train/test split, and it didn't materially change -- 219 0's and 51 1's. The training sample has 507 0's and 120 1's, so their distribution is roughly even. I recognize this is a small dataset, but I'm a survey researcher, so this is all I've got.
  2. I tried logistic regression, and I got the same predictions: all 0's. Code:
from sklearn.linear_model import LogisticRegression
clf = LogisticRegression(random_state=0).fit(train_features, train_labels)
test_pred_log = clf.predict(test_features)
accuracy_log = clf.score(test_features, test_labels)
  1. I took a look at predictions on my training data from the XGBoost model, and they're also all 0's, so the ROC curve doesn't reveal much, but it was a good suggestion to look at the training predictions. The logistic model has the same training predictions: all 0.
train_pred = model.predict(train_features)
fpr, tpr, thresholds = roc_curve(train_labels, train_pred, pos_label=1)
  1. I didn't know I could get probability estimates, so thank you for the tip on pred_proba. My probability estimates are differentiated, so that's great! The probabilities for belonging to class 1 are just all lower -- averaging around 20%, which makes sense, since about 20% of the sample is truly in class 1. The problem is that I don't know how to adjust the threshold on the predictions. I suppose I could do it manually using the results from pred_proba, but is there a way to work that into the estimator instead?
Laura
  • 78
  • 9
  • 1
    Obvious question first, you've mentioned the imbalance in your test set but do you have any positive cases in your _train_ dataset? You should consider using the `stratify` argument when you call `train_test_split`. Also I would suggest you get your pipeline working on a simple model (try logistic regression or knn) first. This will also give you a baseline score to compare against. 1000 estimators seems pretty high for the size of your training set, how many training examples do you have? – Dan Aug 12 '20 at 20:57
  • 1
    Also you haven't checked your training accuracy (or logloss or roc). Maybe also try it first without the weights. And check the output for `predict_proba` if you're concerned about the threshold, maybe you need to adjust that to be much lower. Plotting an ROC could help. – Dan Aug 12 '20 at 21:04
  • 1
    Thanks, Dan! I've edited my post to reflect some of the suggestions you made. I think ultimately the threshold is probably the issue, since the probabilities are differentiated, but they're all on the lower side given the unbalanced nature of the data. Is there a way to adjust that threshold in the model, rather than manually predicting class membership via the membership probabilities? – Laura Aug 13 '20 at 13:56
  • btw the roc curve needs the predicted probabilities not the labels. It works by varying the threshold. Although when it comes to picking a threshold, this is a great visualization to use https://www.scikit-yb.org/en/latest/api/classifier/threshold.html. Although bear in mind choosing the threshold is more of a business decision, i.e. you need to know the cost of FPs and FNs and any minimum acceptable levels etc in order to choose that trade-off – Dan Aug 13 '20 at 15:08
  • Regarding changing the threshold, I don't know if you can but you never need to if you have the `predict_proba` method it's just a case of doing a `mode.predict_proba >= threshold` – Dan Aug 13 '20 at 15:12

1 Answers1

1

Found an answer on the stats section: https://stats.stackexchange.com/questions/243207/what-is-the-proper-usage-of-scale-pos-weight-in-xgboost-for-imbalanced-datasets

scale_pos_weight seems to be a parameter that you can adjust to deal with imbalances in classes like this. Mine was set to the default, 1, which means that negative (0) and positive (1) cases are assumed to be showing up evenly. If I change this to 4, which is my ratio of negatives to positives, I start seeing cases predicted into 1.

My accuracy score goes down, but this makes sense: you get a higher % accuracy with this data by predicting everyone to be 0 since the vast majority of cases are 0, but I want to run this model not for accuracy but for information on the importances/contributions of each predictor, so I want differing predictions.

One answer in the link also suggested being more conservative by setting scale_pos_weight to the sqrt of the ratio, which would be 2, in this case. I got a higher accuracy with 2 than 4, so that's what I'm going with, and I plan to look into this parameter in future classification models.

For a multi-class model, it looks like you're better off adjusting the case-level weights to bring your classes to even representation, as outlined here: https://datascience.stackexchange.com/questions/16342/unbalanced-multiclass-data-with-xgboost

Laura
  • 78
  • 9