Seaborn Regplot and Scikit-Learn Logistic Models Calculated Differently?

Question

I'm using both the Scikit-Learn and Seaborn logistic regression functions -- the former for extracting model info (i.e. log-odds, parameters, etc.) and the later for plotting the resulting sigmoidal curve fit to the probability estimations.

Maybe my intuition is incorrect for how to interpret this plot, but I don't seem to be getting results as I'd expect:

#Build and visualize a simple logistic regression
ap_X = ap[['TOEFL Score']].values 
ap_y = ap['Chance of Admit'].values

ap_lr = LogisticRegression()
ap_lr.fit(ap_X, ap_y)

def ap_log_regplot(ap_X, ap_y):
    plt.figure(figsize=(15,10))
    sns.regplot(ap_X, ap_y, logistic=True, color='green')
    return None

ap_log_regplot(ap_X, ap_y)
plt.xlabel('TOEFL Score')
plt.ylabel('Probability')
plt.title('Logistic Regression: Probability of High Chance by TOEFL Score')
plt.show

Seems alright, but then I attempt to use the predict_proba function in Scikit-Learn to find the probabilities of Chance to Admit given some arbitrary value for TOEFL Score (in this case 108, 104, and 112):

eight = ap_lr.predict_proba(108)[:, 1]
four = ap_lr.predict_proba(104)[:, 1]
twelve = ap_lr.predict_proba(112)[:, 1]
print(eight, four, twelve)

Where I get:

[0.49939019] [0.44665597] [0.55213799]

To me, this seems to indicate that a TOEFL Score of 112 gives an individual a 55% chance of being admitted based on this data set. If I were to extend a vertical line from 112 on the x-axis to the sigmoid curve, I'd expect the intersection at around .90.

Am I interpreting/modeling this correctly? I realize that I'm using two different packages to calculate the model coefficients but with another model using a different data set, I seem to get correct predictions that fit the logistic curve.

Any ideas or am I completely modeling/interpreting this inaccurately?

you should use train test split. then train with train set and predict with test set. then find accuracy score — Nihal, Aug 28 '18 at 07:55

score 1 · Answer 1 · answered Aug 28 '18 at 07:57

1

from sklearn.linear_model import LogisticRegression
from sklearn import metrics

x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.4, random_state=4)
print(x_train.shape)
print(x_test.shape)
print(y_train.shape)
print(y_test.shape)

logreg = LogisticRegression()
logreg.fit(x_train, y_train)

y_pred = logreg.predict(x_test)
print('log: ', metrics.accuracy_score(y_test, y_pred))

you can easily find model accuracy like this and decide which model you can use for your application data.

answered Aug 28 '18 at 07:57

Nihal

5,262
7
23
41

My question wasn't really about prediction accuracy but rather how the coefficient estimations between Scikit-Learn and Statsmodels differ in the context of logistic regression. I believe I found the answer in Cross-Validated (see below). – John Sukup Aug 28 '18 at 18:16

score 1 · Accepted Answer · answered Aug 28 '18 at 18:22

After some searching, Cross-Validated provided the correct answer to my question. Although it already exists on Cross-Validated, I wanted to provide this answer on Stack Overflow as well.

Simply put, Scikit-Learn automatically adds a regularization penalty to the logistic model that shrinks the coefficients. Statsmodels does not add this penalty. There is apparently no way to turn this off so one has to set the C= parameter within the LogisticRegression instantiation to some arbitrarily high value like C=1e9.

After trying this and comparing the Scikit-Learn predict_proba() to the sigmoidal graph produced by regplot (which uses statsmodels for its calculation), the probability estimates align.

Link to full post: https://stats.stackexchange.com/questions/203740/logistic-regression-scikit-learn-vs-statsmodels

Seaborn Regplot and Scikit-Learn Logistic Models Calculated Differently?

2 Answers2