4

I have the following code:

from sklearn.metrics import roc_curve, auc

actual      = [1,1,1,0,0,1]
prediction_scores = [0.9,0.9,0.9,0.1,0.1,0.1]
false_positive_rate, true_positive_rate, thresholds = roc_curve(actual, prediction_scores, pos_label=1)
roc_auc = auc(false_positive_rate, true_positive_rate)
roc_auc
# 0.875

In this example the interpretation of prediction_scores is straightforward namely, the higher the score the more confident the prediction is.

Now I have another set of prediction prediction scores. It is non-fractional, and the interpretation is the reverse. Meaning the lower the score more confident the prediction is.

prediction_scores_v2 = [10.3,10.3,10.2,10.5,2000.34,2000.34]
# so this is equivalent 

My question is: how can I scale that in prediction_scores_v2 so that it gives similar AUC score like the first one?

To put it another way, Scikit's ROC_CURVE requires the y_score to be probability estimates of the positive class. How can I treat the value if the y_score I have is probability estimates of the wrong class?

serv-inc
  • 35,772
  • 9
  • 166
  • 188
neversaint
  • 60,904
  • 137
  • 310
  • 477
  • I'm not sure what you're asking. What do your new prediction scores represent? – BrenBarn May 13 '16 at 06:29
  • @BrenBarn: You can see it as the 'inverse' of confidence. – neversaint May 13 '16 at 06:30
  • 1
    In what sense? You generate the AUC from specific information, namely the false positive rate and true positive rate for various discrimination threshholds. You can't just take some arbitrary numbers and calculate an AUC from that. You need to explain what those numbers represent, statistically/mathematically speaking. – BrenBarn May 13 '16 at 06:32
  • @BrenBarn: I want to use Scikit-Learn ROC to measure performance of a prediction tool. They have they own formula to calculate that score. But this tool gives the value as I stated in V2. The interpretation of that value is as I said, the lower the better. – neversaint May 13 '16 at 06:36
  • "The lower the better" is not specific enough. You need to know how to interpret the actual numbers. What is the difference between 5 and 10? What about between 5 and 6? What makes you think you can use those values to calculate the AUC at all? – BrenBarn May 13 '16 at 07:01

2 Answers2

5

For AUC, you really only care about the order of your predictions. So as long as that is true, you can just get your predictions into a format that AUC will accept.

You'll want to divide by the max to get your predictions to be between 0 and 1, and then subtract from 1 since lower is better in your case:

max_pred = max(prediction_scores_v2)
prediction_scores_v2[:] = (1-x/max_pred for x in prediction_scores_v2)

false_positive_rate, true_positive_rate, thresholds = roc_curve(actual, prediction_scores_v2, pos_label=1)
roc_auc = auc(false_positive_rate, true_positive_rate)
# 0.8125
Tchotchke
  • 3,061
  • 3
  • 22
  • 37
0

How can I treat the value if the y_score I have is probability estimates of the wrong class?

This is a really cheap shot, but have you considered reversing the original class list, as in

actual      = [abs(x-1) for x in actual]

Then, you could still apply the normalization @Tchotchke proposed.

Still, in the end, @BrenBarn seems right. If possible, have an in-depth look at how these values are created and/or used in the other prediction tool.

Community
  • 1
  • 1
serv-inc
  • 35,772
  • 9
  • 166
  • 188