4

I was running logistic regression on iris dataset on both R and Python.But both are giving different results(coefficients,intercept and scores).

#Python codes.
    In[23]: iris_df.head(5)
    Out[23]: 
     Sepal.Length  Sepal.Width  Petal.Length  Petal.Width  Species
    0           5.1          3.5           1.4          0.2        0
    1           4.9          3.0           1.4          0.2        0
    2           4.7          3.2           1.3          0.2        0
    3           4.6          3.1           1.5          0.2        0
    In[35]: iris_df.shape
    Out[35]: (100, 5)
    #looking at the levels of the Species dependent variable..

        In[25]: iris_df['Species'].unique()
        Out[25]: array([0, 1], dtype=int64)

    #creating dependent and independent variable datasets..

        x = iris_df.ix[:,0:4]
        y = iris_df.ix[:,-1]

    #modelling starts..
    y = np.ravel(y)
    logistic = LogisticRegression()
    model = logistic.fit(x,y)
    #getting the model coefficients..
    model_coef= pd.DataFrame(list(zip(x.columns, np.transpose(model.coef_))))
    model_intercept = model.intercept_
    In[30]: model_coef
    Out[36]: 
                  0                  1
    0  Sepal.Length  [-0.402473917528]
    1   Sepal.Width   [-1.46382924771]
    2  Petal.Length    [2.23785647964]
    3   Petal.Width     [1.0000929404]
    In[31]: model_intercept
    Out[31]: array([-0.25906453])
    #scores...
    In[34]: logistic.predict_proba(x)
    Out[34]: 
    array([[ 0.9837306 ,  0.0162694 ],
           [ 0.96407227,  0.03592773],
           [ 0.97647105,  0.02352895],
           [ 0.95654126,  0.04345874],
           [ 0.98534488,  0.01465512],
           [ 0.98086592,  0.01913408],

R codes.

> str(irisdf)
'data.frame':   100 obs. of  5 variables:
 $ Sepal.Length: num  5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
 $ Sepal.Width : num  3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
 $ Petal.Length: num  1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
 $ Petal.Width : num  0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
 $ Species     : int  0 0 0 0 0 0 0 0 0 0 ...

 > model <- glm(Species ~ ., data = irisdf, family = binomial)
Warning messages:
1: glm.fit: algorithm did not converge 
2: glm.fit: fitted probabilities numerically 0 or 1 occurred 
> summary(model)

Call:
glm(formula = Species ~ ., family = binomial, data = irisdf)

Deviance Residuals: 
       Min          1Q      Median          3Q         Max  
-1.681e-05  -2.110e-08   0.000e+00   2.110e-08   2.006e-05  

Coefficients:
               Estimate Std. Error z value Pr(>|z|)
(Intercept)       6.556 601950.324       0        1
Sepal.Length     -9.879 194223.245       0        1
Sepal.Width      -7.418  92924.451       0        1
Petal.Length     19.054 144515.981       0        1
Petal.Width      25.033 216058.936       0        1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 1.3863e+02  on 99  degrees of freedom
Residual deviance: 1.3166e-09  on 95  degrees of freedom
AIC: 10

Number of Fisher Scoring iterations: 25

Due to convergence problem,i increased the maximum iteration and gave epsilon as 0.05.

> model <- glm(Species ~ ., data = irisdf, family = binomial,control = glm.control(epsilon=0.01,trace=FALSE,maxit = 100))
> summary(model)

Call:
glm(formula = Species ~ ., family = binomial, data = irisdf, 
    control = glm.control(epsilon = 0.01, trace = FALSE, maxit = 100))

Deviance Residuals: 
       Min          1Q      Median          3Q         Max  
-0.0102793  -0.0005659  -0.0000052   0.0001438   0.0112531  

Coefficients:
             Estimate Std. Error z value Pr(>|z|)
(Intercept)     1.796    704.352   0.003    0.998
Sepal.Length   -3.426    215.912  -0.016    0.987
Sepal.Width    -4.208    123.513  -0.034    0.973
Petal.Length    7.615    159.478   0.048    0.962
Petal.Width    11.835    285.938   0.041    0.967

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 1.3863e+02  on 99  degrees of freedom
Residual deviance: 5.3910e-04  on 95  degrees of freedom
AIC: 10.001

Number of Fisher Scoring iterations: 12

#R scores..
> scores = predict(model, newdata = irisdf, type = "response")
> head(scores,5)
           1            2            3            4            5 
2.844996e-08 4.627411e-07 1.848093e-07 1.818231e-06 2.631029e-08 

Both the scores,intercept and coefficients are completely different in R and python.Which one is correct and I want to proceed in python.Now having confusion which results are accurate.

Aby Mathew
  • 343
  • 1
  • 4
  • 8
  • One possible reason I can think of could be different unconstrained optimization method used by R and Sklearn for MLE; which may cause the log-likelihood function to end up at different local optima. – abhiieor Jun 17 '16 at 05:18
  • you may also want to follow development on a similar question http://stackoverflow.com/questions/37872536/why-the-auc-is-so-different-from-logistic-regression-of-sklearn-and-r – abhiieor Jun 17 '16 at 05:21

1 Answers1

3

UPDATED The problem is that there exists perfect separation along the petal width variable. In other words, this variable can be used to perfectly predict whether a sample in the given dataset is setosa or versicolor. This breaks the loglikelihood maximization estimation used in logistic regression in R. The problem is that the loglikelihood can be driven very high by taking the coefficient of petal width to the infinity.

Some background and strategies are discussed here.

There is also a good thread on CrossValidated discussing strategies.

So why does the sklearn LogisticRegression work? Because it employs "regularized logistic regression". The regularization penalizes estimating large values for parameters.

In the example below, I use the Firth's bias-reduced method of logistic regression package, logistf, to produce a converged model.

library(logistf)

iris = read.table("path_to _iris.txt", sep="\t", header=TRUE)
iris$Species <- as.factor(iris$Species)
sapply(iris, class)

model1 <- glm(Species ~ ., data = irisdf, family = binomial)
# Does not converge, throws warnings.

model2 <- logistf(Species ~ ., data = irisdf, family = binomial)
# Does converge.

ORIGINAL Based on the std.error and z-values in the R solution, I think you have a bad model specification. A z-value of close to 0 essentially tells you there is no correlation between the model and the dependent variable. So this is a nonsensical model.

My first thought is that you need to transform that Species field into a categorical variable. It is an int type in your example. Try using as.factor

How to convert integer into categorical data in R?

Community
  • 1
  • 1
andrew
  • 3,929
  • 1
  • 25
  • 38
  • I tried with converting the dependent variable into factors.but the results are same. – Aby Mathew Jun 17 '16 at 05:41
  • Strange. Which Iris data set are using? I can take a look at it today. – andrew Jun 17 '16 at 17:52
  • I will send the data to you..Can you share the mail?i converted setosa to 0 and versicolor to 1,the third one i removed to make the levels 2. – Aby Mathew Jun 20 '16 at 06:08
  • I rather not post my email on a public website :-) How about you make a Gist I can download? https://gist.github.com/ – andrew Jun 20 '16 at 19:55
  • Ahh... I should have seen the problem earlier. You have perfect / quasi-perfect separation on the petal width variable. I will edit my answer to address this. – andrew Jun 21 '16 at 18:35
  • 1
    By the way, I should have caught that without seeing the data. The hint was in the warning message: "fitted probabilities numerically 0 or 1 occurred ". – andrew Jun 21 '16 at 18:51
  • 2
    The other clue is that the `Petal.Width` coefficient estimate is large (anything >10 in a logistic regression is a clue that perfect separation is probably happening ...) – Ben Bolker Jun 21 '16 at 18:57
  • Thanks for the answer.I tried logistic regression by changing the values of petal width.The dataset link is https://gist.github.com/abyalias/deb07d4b6c66ca652e8eac5121f0ce35. The regression in R(logistf package) and Python are given https://gist.github.com/abyalias/e51d059569f91b5920f60803e2cf34cd But still the coefficients and intercept are different. – Aby Mathew Jun 23 '16 at 06:03
  • 1
    The two methods produce different results because they are maximizing different formulations for the penalized likelihood function. If this the term "penalized likelihood" does not mean anything to you, you might do some research on regression shrinkage methods. Chapter 6.2 of ISLR gives a nice intro to Ridge and Lasso regression. They are not presented for logistic regression, but you should get the idea. – andrew Jun 23 '16 at 18:39
  • thanks andrew..now got some idea..need to go through in detail – Aby Mathew Jun 24 '16 at 05:07