12

I have a question about xgboost classifier with sklearn API. It seems it has a parameter to tell how much probability should be returned as True, but i can't find it.

Normally, xgb.predict would return boolean and xgb.predict_proba would return probability within interval [0,1]. I think the result is related. There should be a probability threshold to decide sample's class.

dtrain, dtest = train_test_split(data, test_size=0.1, random_state=22)

param_dict={'base_score': 0.5,
 'booster': 'gbtree',
 'colsample_bylevel': 1,
 'colsample_bytree': 1,
 'gamma': 0,
 'learning_rate': 0.1,
 'max_delta_step': 0,
 'max_depth': 4,
 'min_child_weight': 6,
 'missing': None,
 'n_estimators': 1000,
 'objective': 'binary:logistic',
 'reg_alpha': 0,
 'reg_lambda': 1,
 'scale_pos_weight': 1,
 'subsample': 1}

xgb = XGBClassifier(**param_dict,n_jobs=2)

xgb.fit(dtrain[features], dtrain['target'])

result_boolean = xgb.predict(dtest[features])
print(np.sum(result_boolean))
Output:936

result_proba = xgb.predict_proba(dtest[features])
result_boolean2= (result_proba[:,1] > 0.5) 
print(np.sum(result_boolean2))
Output:936

It looks like the default probability threshold is 0.5, so the result array has same amount of True. But I can't find where to adjust it in the code. predict(data, output_margin=False, ntree_limit=None, validate_features=True) Also, I have tested base_score, but it didn't affect the result.

The main reason I want to change probability threshold is that I want to test XGBClassifier with different probability threshold by GridSearchCV method. xgb.predict_proba seems like it can't be merged into GridSearchCV. How to change probability threshold in the XGBClassifier?

劉金喜
  • 121
  • 1
  • 1
  • 3
  • What exactly is the problem with predict_proba() and GridSearchCV ? – Jon Nordby Apr 12 '19 at 19:32
  • Sorry, I find that 'can't be merged into GridSearchCV' is quite misleading.For example,If i write `grid = GridSearchCV(xb, param_grid, scoring='precision',fit_params=fit_params,cv=4)` `grid.fit(X=dtrain[features],y=dtrain[target])` And then I would get the best parameters based on the precision when the probability threshold is 0.5. But I want to change the probability threshold to 0.7 or 0.8. – 劉金喜 Apr 13 '19 at 12:10

2 Answers2

2

When you use ROC AUC (ROC=Receiver Operating Characteristic, AUC=Area Under Curve) as the scoring function, the gridsearch will be done with predict_proba(). The chosen classifier hyperparameter will be the one that has the best overall performance across all possible decision thresholds.

GridSearchCV(scoring='roc_auc', ....)

Then you can plot the ROC curve in order to determine the decision threshold that gives you the desired balance of precision vs. recall / true-positive vs. false-negative.

enter image description here

More info in scikit-learn documentation on ROC

Jon Nordby
  • 5,494
  • 1
  • 21
  • 50
  • 4
    Thanks. I thnik ROC-AUC can be useful in my case. But is it possible to change the decision threshold of XGBClassifier, so i didn't need to use `predict_proba` and then set decision threshold by myself? – 劉金喜 Apr 15 '19 at 16:01
  • 5
    Is it me or this doesn't answer the question? – PJ_ Sep 14 '22 at 14:57
  • This doesn't answer the question! – A.B. Mar 26 '23 at 19:24
2

I think you should look at the source code to understand. I had troubles to find it, but I found as it works in lightgbm and I guess that xgboost should work similarly.

Go here (https://lightgbm.readthedocs.io/en/latest/pythonapi/lightgbm.LGBMClassifier.html#lightgbm.LGBMClassifier.predict) and look at the method "predict":

def predict(self, X, raw_score=False, num_iteration=None,
            pred_leaf=False, pred_contrib=False, **kwargs):
    """Docstring is inherited from the LGBMModel."""
    result = self.predict_proba(X, raw_score, num_iteration,
                                pred_leaf, pred_contrib, **kwargs)
    if callable(self._objective) or raw_score or pred_leaf or pred_contrib:
        return result
    else:
        class_index = np.argmax(result, axis=1)
        return self._le.inverse_transform(class_index)


predict.__doc__ = LGBMModel.predict.__doc__

Practically the classifier is trained as a multi-class classifier every time and it chooses the class that has a higher probability. The output of predict_proba is:

predicted_probability (array-like of shape = [n_samples, n_classes]) – The predicted probability for each class for each sample.

And you see that the method says:

class_index = np.argmax(result, axis=1)

Where "result" is the output of predict_proba. Now, if you have predict_proba for many classes do they sum to 1? I guess so, but I suppose we should go into the classifier loss function to really understand what is going on...

this is what I would read next! http://wiki.fast.ai/index.php/Log_Loss

spec3
  • 461
  • 1
  • 7
  • 15