1

I run Logistic Regression on a very small and simple dataset that is well separable. But I realized that the model cannot find the optimal decision boundary. Where is my mistake?

import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import numpy as np
from sklearn import linear_model

sm_df = pd.DataFrame()
sm_df['x'] = [0.5,4.0,1.0,2.5,2.0,3.5,1.0,3.0, 1.0, 2.0]
sm_df['y'] = [1.0,3.5,1.0,3.5,1.0, 4.5, 2.0,3.0, 0.0, 2.5]
sm_df['Bad_data'] = [True, False, True, False, True, False, True, False, True, False]

log = linear_model.LogisticRegression()

log.fit(sm_df[['x','y']], sm_df['Bad_data'])
test_score = log.score(sm_df[['x','y']], sm_df['Bad_data'])
print("test score: ", test_score)

# Create scatterplot of dataframe
sns.lmplot('x', # Horizontal axis
           'y', # Vertical axis
           data=sm_df, # Data source
           fit_reg=False, # Don't fix a regression line
           hue="Bad_data", # Set color
           scatter_kws={"marker": "D", # Set marker style
                        "s": 100}) # S marker size

plt.xlabel('x')
plt.ylabel('y')

# to plot desision bountdary
w0 = log.intercept_
w1, w2 = log.coef_[0]

X = np.array([0,4])
x2 = np.array([-w0/w2, -w0/w2 -w1*4/w2])
plt.plot(X, x2)
t_x = [1.5]
t_y = [1.8]
pr = log.predict([1.5,1.8])
plt.scatter(t_x, # Horizontal axis
           t_y, c='r') # S marker size
plt.annotate(pr, ([1.5,1.9]))

my plot:

Jeremy McGibbon
  • 3,527
  • 14
  • 22
  • You can change the [default solver](http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html#sklearn.linear_model.LogisticRegression) from `'liblinear'` to others which will give perfect results on this data. `log = linear_model.LogisticRegression(solver='newton-cg')` – Vivek Kumar Sep 20 '17 at 05:52
  • 1
    For implementation reasons, the default solver `'liblinear'` penalizes the intercept, whereas it's not advisable. All the other solvers do not penalize the intercept, and should give you the correct boundary. – TomDLT Sep 20 '17 at 12:39
  • Interesting. Thanks @TomDLT . For a moment i tried to grasp the implications of the comment above together with the answer. – sascha Sep 20 '17 at 12:41
  • Thank you, TomDLT. Good to know that the intercept can be penalized. – Ekaterina Tcareva Sep 20 '17 at 17:26

1 Answers1

1

The reason is because error is not the only thing the model is penalized for - there's also a regularization term. If you make the regularization term smaller with something like

log = linear_model.LogisticRegression(C=10.)

then all points will be classified correctly in this example. That's because the model will then care relatively more about classifying the points correctly and relatively less about regularization. Here the argument C is the inverse of the regularization strength, and is 1 by default.

Part of why this is necessary here is that your data is not standardized. If you standardize the data before applying the logistic regression (give x and y zero mean and variance of 1), then you also get a perfect fit with C=1. You can do this with something like

sm_df['x'] = (sm_df['x'] - sm_df['x'].mean()) / sm_df['x'].std()
sm_df['y'] = (sm_df['y'] - sm_df['y'].mean()) / sm_df['y'].std()
Jeremy McGibbon
  • 3,527
  • 14
  • 22
  • Thank you, Jeremy! It works. It looks like that when we have a well separable dataset, we should less care about regularization. – Ekaterina Tcareva Sep 20 '17 at 17:23
  • Sort of! Part of why you have to increase C is because your data isn't standardized, so the model needs large coefficients to fit the data (regularization discourages large coefficients). I updated my answer to reflect this. Another reason is that the purpose of regularization is to prevent over-fitting your data, which basically means not trying too hard to get all the training data right (but then it may generalize better on unseen data). – Jeremy McGibbon Sep 20 '17 at 18:00