13

I'm wondering if I can do calibration in xgboost. To be more specific, does xgboost come with an existing calibration implementation like in scikit-learn, or are there some ways to put the model from xgboost into a scikit-learn's CalibratedClassifierCV?

As far as I know in sklearn this is the common procedure:

# Train random forest classifier, calibrate on validation data and evaluate
# on test data
clf = RandomForestClassifier(n_estimators=25)
clf.fit(X_train, y_train)
clf_probs = clf.predict_proba(X_test)
sig_clf = CalibratedClassifierCV(clf, method="sigmoid", cv="prefit")
sig_clf.fit(X_valid, y_valid)
sig_clf_probs = sig_clf.predict_proba(X_test)
sig_score = log_loss(y_test, sig_clf_probs)
print "Calibrated score is ",sig_score

If I put an xgboost tree model into the CalibratedClassifierCV an error will be thrown (of course):

RuntimeError: classifier has no decision_function or predict_proba method.

Is there a way to integrate the excellent calibration module of scikit-learn with xgboost?

Appreciate your insightful ideas!

OrlandoL
  • 898
  • 2
  • 12
  • 32

2 Answers2

9

Answering to my own question, an xgboost GBT can be integrated with scikit-learn by writing a wrapper class like the case below.

class XGBoostClassifier():
def __init__(self, num_boost_round=10, **params):
    self.clf = None
    self.num_boost_round = num_boost_round
    self.params = params
    self.params.update({'objective': 'multi:softprob'})

def fit(self, X, y, num_boost_round=None):
    num_boost_round = num_boost_round or self.num_boost_round
    self.label2num = dict((label, i) for i, label in enumerate(sorted(set(y))))
    dtrain = xgb.DMatrix(X, label=[self.label2num[label] for label in y])
    self.clf = xgb.train(params=self.params, dtrain=dtrain, num_boost_round=num_boost_round)

def predict(self, X):
    num2label = dict((i, label)for label, i in self.label2num.items())
    Y = self.predict_proba(X)
    y = np.argmax(Y, axis=1)
    return np.array([num2label[i] for i in y])

def predict_proba(self, X):
    dtest = xgb.DMatrix(X)
    return self.clf.predict(dtest)

def score(self, X, y):
    Y = self.predict_proba(X)
    return 1 / logloss(y, Y)

def get_params(self, deep=True):
    return self.params

def set_params(self, **params):
    if 'num_boost_round' in params:
        self.num_boost_round = params.pop('num_boost_round')
    if 'objective' in params:
        del params['objective']
    self.params.update(params)
    return self

See full example here.

Please don't hesitate to provide a smarter way of doing this!

OrlandoL
  • 898
  • 2
  • 12
  • 32
  • 2
    nice job. I've found that additional calibration on techniques where logloss is directly optimized (like xgboost) don't yield as much. Random forests and SVM's are known culprits of being highly discriminative classifiers, but because they are optimizing different things, can use some calibration. Nice job – T. Scharf Feb 24 '16 at 14:23
3

A note from the hell scape that is July 2020:

You no longer need a wrapper class. The predict_proba method is built into the xgboost sklearn python apis. Not sure when they were added but they are there for v1.0.0 on for certain.

Note: this is of course only true for classes that would have the predict_proba method. Ex: The XGBRegressor doesn't. The XGBClassifier does.

Robert Beatty
  • 508
  • 5
  • 11