Using the same validation set for model and for CalibratedClassifierCV

Question

In the documentation of CalibratedClassifierCV it's stated

The samples that are used to train the calibrator should not be used to train the target classifier.

As far as I would understand; it is that if we use X_train to train our model then we should not use X_train to train the calibrator (since I assume it would just map X_train to the model.predict_proba?).

But, would it be fair to use the validation set, X_val which has been used for hyper-parameter optimization for model to calibrate calibrator?

This question is a better fit for stats.SE or datascience.SE. — Ben Reiniger, Jan 12 '21 at 16:41

score 2 · Answer 1 · answered Jan 12 '21 at 16:18

CalibratedClassifierCV fits classifier on a sample of data and then sees if predicted probas correspond well to the real labels on a different dataset. Then it tries to adjust ("calibrate") probas so that for a group of samples with proba 0.7 eg, there will be approximately 0.7 true labels with "1" (for a binary classification).

Note, training classifier and fitting calibrator is done on different datasets to account for a possible bias on an unseen data. This can be done in 2 different ways:

CalibratedClassifierCV(..., cv=None, ensemble=True). You split your train data according to supplied cv strategy, say cv=5. As usual, you train base estimator (classifier to be calibrated) on k-1 folds, and then fit calibrator on the kth left out fold. The resulting predictions for test will be the average of k fitted calibrators ("ensemble").
If ensemple=False a single fit of cross_val_predict will be used to fit calibrator.

With this in mind:

would it be fair to use the validation set, X_val which has been used for hyper-parameter optimization for model to calibrate calibrator?

You find the best hyperparams, and then yes, you use the same train dataset, but with the chosen cv strategy (ensemble True or False), to fit calibrator. Note again, you never fit classifier and calibrartor on the same data.

It's interesting to see, depending on the classifier and size of the data calibration may or may not help you increase the accuracy of probability predictions:

from sklearn import datasets
from sklearn.naive_bayes import GaussianNB
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC
from sklearn.calibration import calibration_curve, CalibratedClassifierCV
from xgboost import XGBClassifier
from lightgbm import LGBMClassifier
from catboost import CatBoostClassifier

np.random.seed(42)

# Create classifiers
lrc = LogisticRegression(n_jobs=-1)
gnb = GaussianNB()
svc = SVC(C=1.0, probability=True,)
rfc = RandomForestClassifier(n_estimators=300, max_depth=3,n_jobs=-1)
xgb = XGBClassifier(
    n_estimators=300,
    max_depth=3,
    objective="binary:logistic",
    eval_metric="logloss",
    use_label_encoder=False,
)
lgb = LGBMClassifier(n_estimators=300, objective="binary", max_depth=3)
cat = CatBoostClassifier(n_estimators=300, max_depth=3, objective="Logloss", verbose=0)

df = pd.DataFrame()

plt.figure(figsize=(10, 10))
ax1 = plt.subplot2grid((3, 1), (0, 0), rowspan=2)
ax2 = plt.subplot2grid((3, 1), (2, 0))
ax1.plot([0, 1], [0, 1], "k:", label="Perfectly calibrated")

for clf, name in [
    (lrc, "Logistic"),
    (gnb, "Naive Bayes"),
#     (svc, "Support Vector Classification"),
    (rfc, "Random Forest"),
    (lgb, "Light GBM"),
    (xgb, "Xgboost"),
    (cat, "Catboost"),
]:
    print(name)
    for nsamples in [1000,10000,100000]:
        train_samples = 0.75 
        X, y = make_classification(
            n_samples=nsamples, n_features=20, n_informative=2, n_redundant=2
        )
        i = int(train_samples * nsamples)
        X_train = X[:i]
        X_test  = X[i:]
        y_train = y[:i]
        y_test  = y[i:]
        clf.fit(X_train, y_train)
        prob_pos = clf.predict_proba(X_test)[:, 1]
        fraction_of_positives, mean_predicted_value = calibration_curve(
            y_test, prob_pos, n_bins=10
        )

        if nsamples in [10000]:
            ax1.plot(
                mean_predicted_value,
                fraction_of_positives,
                "s-",
                label="%s" % (name + " nsamples " + str(nsamples),),
            )

            ax2.hist(
                prob_pos,
                bins=10,
                label="%s" % (name + " nsamples " + str(nsamples),),
                histtype="step",
                lw=2,
            )

        preds = clf.predict_proba(X_test)
        ll_before = log_loss(y_test, preds)
        preds = (
            CalibratedClassifierCV(clf, cv=5)
            .fit(X_train, y_train)
            .predict_proba(X_test)
        )
        ll_after = log_loss(y_test, preds)
        df = df.append(pd.DataFrame({
            "Samples": [nsamples],
            "Model": name,
            "LogLoss Before": round(ll_before,4),
            "LogLoss After": round(ll_after,4),
            "Gain": round(ll_before/ll_after,4)
        }))

ax1.set_ylabel("Fraction of positives")
ax1.set_ylim([-0.05, 1.05])
ax1.legend(loc="lower right")
ax1.set_title("Calibration plots  (reliability curve)")

ax2.set_xlabel("Mean predicted value")
ax2.set_ylabel("Count")
ax2.legend(loc="upper center", ncol=2)

plt.tight_layout()
print(df)

   Samples          Model  LogLoss Before  LogLoss After    Gain
0     1000       Logistic          0.3941         0.3854  1.0226
0    10000       Logistic          0.1340         0.1345  0.9959
0   100000       Logistic          0.1645         0.1645  0.9999
0     1000    Naive Bayes          0.3025         0.2291  1.3206
0    10000    Naive Bayes          0.4094         0.3055  1.3403
0   100000    Naive Bayes          0.4119         0.2594  1.5881
0     1000  Random Forest          0.4438         0.3137  1.4146
0    10000  Random Forest          0.3450         0.2776  1.2427
0   100000  Random Forest          0.3104         0.1642  1.8902
0     1000      Light GBM          0.2993         0.2219  1.3490
0    10000      Light GBM          0.2074         0.2182  0.9507
0   100000      Light GBM          0.2397         0.2534  0.9459
0     1000        Xgboost          0.1870         0.1638  1.1414
0    10000        Xgboost          0.3072         0.2967  1.0351
0   100000        Xgboost          0.1136         0.1186  0.9575
0     1000       Catboost          0.1834         0.1901  0.9649
0    10000       Catboost          0.1251         0.1377  0.9085
0   100000       Catboost          0.1600         0.1727  0.9264

Using the same validation set for model and for CalibratedClassifierCV

1 Answers1