Scikit correct way to calibrate classifiers with CalibratedClassifierCV

Question

Scikit has CalibratedClassifierCV, which allows us to calibrate our models on a particular X, y pair. It also states clearly that data for fitting the classifier and for calibrating it must be disjoint.

If they must be disjoint, is it legitimate to train the classifier with the following?

model = CalibratedClassifierCV(my_classifier)
model.fit(X_train, y_train)

I fear that by using the same training set I'm breaking the disjoint data rule. An alternative might be to have a validation set

my_classifier.fit(X_train, y_train)
model = CalibratedClassifierCV(my_classifier, cv='prefit')
model.fit(X_valid, y_valid)

Which has the disadvantage of leaving less data for training. Also, if CalibratedClassifierCV should only be fit on models fit on a different training set, why would it's default options be cv=3, which will also fit the base estimator? Does the cross validation handle the disjoint rule on its own?

Question: what is the correct way to use CalibratedClassifierCV?

score 5 · Accepted Answer · edited Apr 13 '17 at 12:44

I already answered this in CrossValidated to the exact same question. I'm leaving it here anyways since it is not clear for me whether this question belongs here or to CrossVal.

--- Original answer ---

There are two things mentioned in the CalibratedClassifierCV docs that hint towards the ways it can be used:

base_estimator: If cv=prefit, the classifier must have been fit already on data.

cv: If “prefit” is passed, it is assumed that base_estimator has been fitted already and all data is used for calibration.

I may obviously be interpreting this wrong, but it appears you can use the CCCV (short for CalibratedClassifierCV) in two ways:

Number one:

You train your model as usual, your_model.fit(X_train, y_train).
Then, you create your CCCV instance, your_cccv = CalibratedClassifierCV(your_model, cv='prefit'). Notice you set cv to flag that your model has already been fit.
Finally, you call your_cccv.fit(X_validation, y_validation). This validation data is used solely for calibration purposes.

Number two:

You have a new, untrained model.
Then you create your_cccv=CalibratedClassifierCV(your_untrained_model, cv=3). Notice cv is now the number of folds.
Finally, you call cccv_instance.fit(X, y). Because your model is untrained, X and y have to be used for both training and calibration. The way to ensure the data is 'disjoint' is cross validation: for any given fold, CCCV will split X and y into your training and calibration data, so they do not overlap.

TLDR: Method one allows you to control what is used for training and for calibration. Method two uses cross validation to try and make the most out of your data for both purposes.

ok, that helps a lot. the documentation was not very useful and clear on this question. — agenis, Oct 01 '19 at 14:25

Scikit correct way to calibrate classifiers with CalibratedClassifierCV

1 Answers1