Scikit has CalibratedClassifierCV, which allows us to calibrate our models on a particular X, y pair. It also states clearly that data for fitting the classifier and for calibrating it must be disjoint.
If they must be disjoint, is it legitimate to train the classifier with the following?
model = CalibratedClassifierCV(my_classifier)
model.fit(X_train, y_train)
I fear that by using the same training set I'm breaking the disjoint data
rule. An alternative might be to have a validation set
my_classifier.fit(X_train, y_train)
model = CalibratedClassifierCV(my_classifier, cv='prefit')
model.fit(X_valid, y_valid)
Which has the disadvantage of leaving less data for training. Also, if CalibratedClassifierCV should only be fit on models fit on a different training set, why would it's default options be cv=3
, which will also fit the base estimator? Does the cross validation handle the disjoint rule on its own?
Question: what is the correct way to use CalibratedClassifierCV?