6

I'm studing the effects of performing a calibrated classifier and I read that the aim of calibrating is to make a classifier's prediction more 'reliable'. With this in mind I think that a calibrated classifier would have a higher score (roc_auc)

When testing this hypothesis in Python with sklearn y found the exact opposite

Could you please explain:

Does calibration improve roc score? (Or any metric)

If it is not true. What is/are the advantage/es of performing calibration?

clf=SVC(probability=True).fit(X_train,y_train)
calibrated=CalibratedClassifierCV(clf,cv=5,method='sigmoid').fit(X_train,y_train)
probs=clf.predict_proba(X_test)[:,1]
cal_probs=calibrated.predict_proba(X_test)[:,1]

plt.figure(figsize=(12,7))
names=['non-calibrated SVM','calibrated SVM']
for i,p in enumerate([probs,cal_probs]):
    plt.subplot(1,2,i+1)
    fpr,tpr,threshold=roc_curve(y_test,p)
    plt.plot(fpr,tpr,label=nombre[i],marker='o')
    plt.title(names[i]+ '\n' + 'ROC: '+ str(round(roc_auc_score(y_test,p),4)))
    plt.plot([0,1],[0,1],color='red',linestyle='--')
    plt.grid()
    plt.tight_layout()
    plt.xlim([0,1])
    plt.ylim([0,1])

enter image description here

Venkatachalam
  • 16,288
  • 9
  • 49
  • 77
Moreno
  • 608
  • 1
  • 9
  • 24
  • @AI_Learning could you please help me to clarify this question. I would really appreciate it – Moreno Jan 23 '19 at 05:05
  • What does 'reliable' mean in your question? As far as I can tell calibration should not change the ranks of the predictions, only their absolute values. Hence the ROC curves should be exactly the same and what you're seeing is an artifact of different training procedures. – Calimo Jan 23 '19 at 07:14

1 Answers1

10

TLDR: Calibration should not affect ROCAUC.

Longer answer:

ROCAUC is a measure of rank ("did we put these observations in the best possible order?"). However, it does not ensure good probabilities.

Example: If I'm classifying how likely someone is to have cancer, I may always say a number between 95% and 99%, and still have perfect ROCAUC, as long as I've made my predictions in the right order (the 99%s had cancer, the 95%s did not).

Here we would say that this classifier (that says 95% when then are unlikely to have cancer) has good ability to rank, but is badly calibrated.

So what can we do? We can apply a monotonic transformation, that fixes it without changing the rank ability (therefore not changing the ROCAUC).

Example: in our cancer example we can say the predictions are under 97.5% they should be decreased by 90%, and when they are over 97.5% they would be kept. This really crass approach will not affect the ROC, but would send the "lowest" predictions to close to 0, improving our calibration, as measured by the Brier Score.

Great, now we can get clever! What is the "best" monotonic curve for improving our Brier Score? Well, we can let Python deal with this by using scikit's calibration, which essentially finds that curve for us. Again, it will improve the calibration, but not change the ROCAUC, as the rank order is maintained.

Great, so the ROCAUC does not move.

And yet...
To quote Galileo after admitting that the Earth does not move around the Sun... "E pur si muove" (and yet it moves)

Ok. Now things get funky. In order to do the monotonic transformations, some observations which were close (e.g. 25% and 25.5%) may get "squished" together (e.g. 0.7% and 0.700000001%). This may be rounded, causing the predictions to become tied. And then, when we calculate ROCAUC... It will have moved.

However, for all practical purposes, you can expect that the "real" ROCAUC does not get affected by calibration, and that it should simply affect your ability to measure probabilities, as measured by Brier Score

sapo_cosmico
  • 6,274
  • 12
  • 45
  • 58