Selecting threshold for a model with binary class labels in python

Question

Usecase : Selecting the "optimal threshold" for a Logistic model built with statsmodel's Logit to predict say, binary classes (or multinomial, but integer classes)

To select your threshold for a (say,logistic) model in Python, is there something that's inbuilt ? For small data sets, I remember, optimizing for the "threshold", by picking up the maximum buckets of true predicted labels (true "0" and true "1") , best seen from the graph here - http://en.wikipedia.org/wiki/Kolmogorov%E2%80%93Smirnov_test

I also know intuitively that if I set alpha values, it should give me a "threshold" that I can use below. How should I compute the threshold given a reduced model with variables, all of which are significant at 95% confidence ? Obviously setting the threshold >0.5 ->"1" would be too naive & since I am looking at 95% confidence this threshold should be "smaller" , meaning p >0.2 or something.

This would then mean something like range of "critical values" if the label should be "1" and "0" otherwise.

What I want is something like this -:

test_scores = smf.Logit(y_train,x_train,missing='drop').fit()
threshold =0.2 
#test_scores.predict(x_train,transform=False) will give the continues probability class, so to transform it into labels, I need to compare it against a threshold, (or x_test if I am testing the model)
y_predicted_train = np.array(test_scores.predict(x_train,transform=False) > threshold, dtype=float)
table = np.histogram2d(y_train, y_predicted_train, bins=2)[0]
# will do the similar on "test" data


# crude way of selecting an optimal threshold
from scipy.stats import ks_2samp
import numpy as np
ks_2samp(y_train, y_predicted_train)
(0.39963996399639962, 0.958989) 
# must get <95 % here & keep modifying the threshold as above till I fail to reject the Null at 95%

# where y_train is REAL values & y_predicted back on the TRAIN dataset . Note that to get y_predicted (as binary, I already did the thresholding as above

Question :-

1. How can I select the threshold in an objective way - ie reduce the percentage of misclassified labels (say I care more for missing "1" (true positives), but not so much if I mispredict a "0" as "1" ( false negatives) & try to reduce this error. This I get from ROC curve . The roc curve in statsmodels(roc_curve) assumes that I have done the labelling for y_predicted class, and I am just revalidating this over test ( point me if my understanding is incorrect). I also think, using the confusion matrix also will not solve picking up the threshold problem

2. Which bring me to - How should I consume the output of these inbuilt functions (oob , confusion_matrix) to suit for selecting the optimal threshold (first on train sample, & then fine tune it over Test & cross validation sample)

I also looked up the official documentation of K-S tests in scipy here- http://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.kstest.html#scipy.stats.kstest

Related -: Statistics Tests (Kolmogorov and T-test) with Python and Rpy2

Hi, I think your questions may be better answered on stats.stackexchange.com. But, FWIW, there's a PR [here](https://github.com/statsmodels/statsmodels/pull/1650) to add some classification performance measures to statsmodels and may start you in the right direction. You might also look at what scikit-learn has. Also, you can get the `confusion_matrix` from the Logit model by using the `pred_table` method. — jseabold, May 14 '14 at 16:19

Selecting threshold for a model with binary class labels in python

0 Answers0