0

I'm getting completely different results when I run a logistic regression in R from when I run it in Python on the same data. The intercepts and coefficients are very different from each other

I've seen this same problem posted here, however the solution there is that for the given data set, the X and Y variables exhibit perfect separation, however in my own data there is no perfect separation.

Here's a reproducible example in R:

x_examp <- c(1,4,7,9,13,17,22,25,29,30,35,40,44,47,50)
y_examp <- c(1,1,1,1,0,1,0,0,1,0,0,0,0,0,0)
mod = glm(y_examp ~ x_examp, family = 'binomial')
summary(mod)

which gives these coefficients (the estimates):

Coefficients:
            Estimate Std. Error z value Pr(>|z|)  
(Intercept)  3.15324    1.75197   1.800   0.0719 .
x_examp     -0.16534    0.07996  -2.068   0.0387 *

And here's an logistic regression in Python using the same data:

x_examp = np.array([1,4,7,9,13,17,22,25,29,30,35,40,44,47,50])
x_examp = x_examp.reshape(-1, 1)
y_examp = np.array([1,1,1,1,0,1,0,0,1,0,0,0,0,0,0])

from sklearn.linear_model import LogisticRegression

LR = LogisticRegression()
LR.fit(x_examp, y_examp)

print('intercept:', LR.intercept_)
print('coefficient:', LR.coef_[0])

which returns:

intercept: [ 1.11232593]
coefficient: [-0.08579351]

Given that the standard error is calculated using the predicted values, which in turn depend on the coefficients, the standard error will differ from that calculated in R, and z-statistics and corresponding probability will also differ.

Clearly the results are very different, does anyone know why this is the case, and which one is correct?

pd441
  • 2,644
  • 9
  • 30
  • 41
  • Set `C` to a huge value to get the same fit. – hellpanderr Oct 10 '18 at 15:10
  • @hellpanderr thanks. could you elaborate? Where is `C` set? – pd441 Oct 10 '18 at 15:13
  • 2
    Regularization constant in sklearn classifier – hellpanderr Oct 10 '18 at 15:15
  • Thanks! now they are the same – pd441 Oct 10 '18 at 15:16
  • Not only does sklearn assume all regressions are regularized (see lasso) but last I checked, it didn’t even standardize variables correctly. It sure makes a lot of assumptions and choices on behalf of its users, which is unfortunate because many lack the stats or ml experience to know what those are. It would seem to violate John Chambers _prime directive_ in Software for Data Analysis... – Justin Oct 10 '18 at 17:48

0 Answers0