1

I'm doing logistic regression in Python with this example from wikipedia. link to example

here's the code I have:

from sklearn.linear_model import LogisticRegression
lr = LogisticRegression()
Z = [[0.5], [0.75], [1.0], [1.25], [1.5], [1.75], [1.75], [2.0], [2.25], [2.5], [2.75], [3.0], [3.25], [3.5], [4.0], [4.25], [4.5], [4.75], [5.0], [5.5]] # number of hours spent studying
y = [0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 1, 1, 1, 1, 1] # 0=failed, 1=pass

lr.fit(Z,y)

results for this are

lr.coef_
array([[ 0.61126347]])

lr.intercept_
array([-1.36550178])

while they get values 1.5046 for hour coefficient and -4.0777 intercept. why are the results so different? their prediction for 1 hour of study is probability 0.07 to pass, while i get 0.32 with this model, these are drastically different results.

DoctorEvil
  • 453
  • 3
  • 6
  • 18
  • I think the problem might be with the actual formula that was used for calculation. Please give the result of a sample prediction once.I'll try to test it. – Kishor Nov 12 '17 at 12:33

2 Answers2

4

The "problem" is that LogisticRegression in scikit-learn uses L2-regularization (aka Tikhonov regularization, aka Ridge, aka normal prior). Please read sklearn user guide about logistic regression for implementational details.

In practice, it means that LogisticRegression has a parameter C, which by default equals 1. The smaller C, the more regularization there is - it means, coef_ grows smaller, and intercept_ larger, which increases numerical stability and reduces overfitting.

If you set C very large, the effect of regularization will vanish. With

lr = LogisticRegression(C=100500000)

you get coef_ and intercept_ respectively

[[ 1.50464535]]
[-4.07771322]

just like in the Wikipedia article.

Some more theory. Overfitting is a problem where there are lots of features, but not too much examples. A simple rule of thumb: use small C, if n_obs/n_features is less that 10. In the wiki example, there is one feature and 20 observations, so simple logistic regression would not overfit even with large C.

Another use case for small C is convergence problems. They may happen if positive and negative examples can be perfectly separated or in case of multicollinearity (which again is more likely if n_obs/n_features is small), and lead to infinite growth of coefficient in the non-regularized case.

David Dale
  • 10,958
  • 44
  • 73
  • if I got this right, for a model with lots of training examples and variance (very similar inputs could easily produce different results), small C would be better to prevent overfitting? this example from wiki seems simplistic and the data is "as expected" so overfitting isn't a problem here. – DoctorEvil Nov 13 '17 at 10:44
0

I think the problem is arising from the fact that you have

Z = [[0.5], [0.75], [1.0], [1.25], [1.5], [1.75], [1.75], [2.0], [2.25], [2.5], [2.75], [3.0], [3.25], [3.5], [4.0], [4.25], [4.5], [4.75], [5.0], [5.5]]

but instead it should be

Z = [0.5, 0.75, 1.0, 1.25, 1.5, 1.75, 1.75, 2.0, 2.25 ...]

Try this

Moulick
  • 4,342
  • 1
  • 13
  • 19
  • then I get :DeprecationWarning: Passing 1d arrays as data is deprecated in 0.17 and will raise ValueError in 0.19. Reshape your data either using X.reshape(-1, 1) if your data has a single feature or X.reshape(1, -1) if it contains a single sample. – DoctorEvil Nov 12 '17 at 12:45
  • @DoctorEvil I was wrong, please refer to David Dale's answer – Moulick Nov 12 '17 at 12:48