1

I am currently using sklearn's Logistic Regression function to work on a synthetic 2d problem. The dataset is shown as below:

Binary label, 2D dataset

I'm basic plugging the data into sklearn's model, and this is what I'm getting (the light green; disregard the dark green):

The code for this is only two lines; model = LogisticRegression(); model.fit(tr_data,tr_labels). I've checked the plotting function; that's fine as well. I'm using no regularizer (should that affect it?)

It seems really strange to me that the boundaries behave in this way. Intuitively I feel they should be more diagonal, as the data is (mostly) located top-right and bottom-left, and from testing some things out it seems a few stray datapoints are what's causing the boundaries to behave in this manner.

For example here's another dataset and its boundaries

2D dataset without stray datapoints

boundaries

Would anyone know what might be causing this? From my understanding Logistic Regression shouldn't be this sensitive to outliers.

2 Answers2

1

Your model is overfitting the data (The decision regions it found perform indeed better on the training set than the diagonal line you would expect).

The loss is optimal when all the data is classified correctly with probability 1. The distances to the decision boundary enter in the probability computation. The unregularized algorithm can use large weights to make the decision region very sharp, so in your example it finds an optimal solution, where (some of) the outliers are classified correctly.

By a stronger regularization you prevent that and the distances play a bigger role. Try different values for the inverse regularization strength C, e.g.

model = LogisticRegression(C=0.1) 
model.fit(tr_data,tr_labels)

Note: the default value C=1.0 corresponds already to a regularized version of logistic regression.

jan
  • 310
  • 3
  • 9
  • Hi, I see why the model might be overfitting without the regularizer, but I don't get why this boundary is optimal. Since there's only a few outliers out of the thousands of datapoints, wouldn't this boundary hurt the loss function more? Since even though it's predicting the few outliers correctly, it's losing out by having the boundary closer to all the other datapoints? Or maybe my understanding of logistic regression isn't correct? – Pian Pawakapan Apr 02 '18 at 02:15
  • Hi, the loss function is optimal (0), when all your data is classified correctly. It doesn't care about distances, only the predicted and true labels enter. So in your special case, the algorithm found that this special boundary has a better loss. – jan Apr 02 '18 at 07:53
  • 1
    Ok, maybe that was a bit simplified: The loss is optimal when all the data is classified correctly with probability 1. The distances do enter in the probability computation. The unregularized algorithm can however use very large weights to make the decision region very sharp. By regularization you prevent that and the distances play a bigger role. – jan Apr 02 '18 at 08:08
  • 1
    I updated the answer accordingly. I hope it helps :) – jan Apr 02 '18 at 09:14
0

Let us further qualify why logistic regression overfits here: After all, there's just a few outliers, but hundreds of other data points. To see why it helps to note that logistic loss is kind of a smoothed version of hinge loss (used in SVM).

SVM does not 'care' about samples on the correct side of the margin at all - as long as they do not cross the margin they inflict zero cost. Since logistic regression is a smoothed version of SVM, the far-away samples do inflict a cost but it is negligible compared to the cost inflicted by samples near the decision boundary.

So, unlike e.g. Linear Discriminant Analysis, samples close to the decision boundary have disproportionately more impact on the solution than far-away samples.

appletree
  • 66
  • 4