why does my own implementation of logistic regression differ from sklearn?

Question

I am trying to implement logistic regression for a binary classification problem from scratch in Python. My results do not match those provided by the implementation of sklearn, as you can see in this example. Note that the lines look "similar", but they are clearly not the same.

I took care of what is mentioned in this answer: both sklearn and me (i) fit the intercept term, and; (ii) do not apply regularization (penalty='none'). Also, while sklearn applies 100 iterations to train the algorithm (by default), I am applying 10000 with a rather small learning rate of 0.01. I tried different combination of values, but the problem does not seem to depend on this.

At the same time, I do notice that, even before comparing the results with sklearn, the ones I obtain with my implementation seem to be wrong: the decision regions are clearly off in some cases. You can see an example in this image.

The last point seems to indicate that the problem is all my own fault. Here is my code (it actually generates new datasets at each run and plots the results):

%matplotlib inline
import numpy as np
import matplotlib.pyplot as plt
from matplotlib.colors import ListedColormap
from sklearn.datasets import make_blobs
from sklearn.linear_model import LogisticRegression

def create_training_set():
    X0, y = make_blobs(n_samples=[100, 100],
                   centers=None,
                   n_features=2,
                   cluster_std=1)
    y = y.reshape(-1, 1) # make y a column vector
    return np.hstack([np.ones((X0.shape[0], 1)), X0]), X0, y

def create_test_set(X0):
    xx, yy = np.meshgrid(np.arange(X0[:, 0].min() - 1, X0[:, 0].max() + 1, 0.1),
                         np.arange(X0[:, 1].min() - 1, X0[:, 1].max() + 1, 0.1))
    X_test = np.c_[xx.ravel(), yy.ravel()]
    X_test = np.hstack([np.ones((X_test.shape[0], 1)), X_test])
    return xx, yy, X_test

def sigmoid(z):
    return 1 / (1 + np.exp(-z))

def apply_gradient_descent(theta, X, y, max_iter=1000, alpha=0.1):
    m = X.shape[0]
    cost_iter = []
    for _ in range(max_iter):
        p_hat = sigmoid(np.dot(X, theta))
        cost_J = -1/float(m) * (np.dot(y.T, np.log(p_hat)) + np.dot((1 - y).T, np.log(1 - p_hat)))
        grad_J = 1/float(m) * np.dot(X.T, p_hat - y)
        theta -= alpha * grad_J
        cost_iter.append(float(cost_J))
    return theta, cost_iter

fig, ax = plt.subplots(10, 2, figsize = (10, 30))
cmap_light = ListedColormap(['#FFAAAA', '#AAFFAA', '#AAAAFF'])
cmap_bold = ListedColormap(['#FF0000', '#00FF00', '#0000FF'])
max_iter = 10000
alpha = 0.1
all_cost_history = []
for n_fil in range(10):
    X_train, X0, y = create_training_set()
    xx, yy, X_test = create_test_set(X0)
    
    theta, cost_evolution = apply_gradient_descent(np.zeros((X_train.shape[1], 1)), X_train, y, max_iter, alpha)   
    all_cost_history.append(cost_evolution)
    
    y_pred = np.where(sigmoid(np.dot(X_test, theta)) > 0.5, 1, 0)
    y_pred = y_pred.reshape(xx.shape)
    ax[n_fil, 0].pcolormesh(xx, yy, y_pred, cmap = cmap_light)
    ax[n_fil, 0].scatter(X0[:, 0], X0[:, 1], c=y.ravel(), cmap=cmap_bold, alpha = 1, edgecolor="black")
    
    y = y.reshape(X_train.shape[0], )
    clf = LogisticRegression().fit(X0, y)
    Z = clf.predict(np.c_[xx.ravel(), yy.ravel()])
    Z = Z.reshape(xx.shape)
    ax[n_fil, 1].pcolormesh(xx, yy, Z, cmap = cmap_light)
    ax[n_fil, 1].scatter(X0[:, 0], X0[:, 1], c=y, cmap=cmap_bold, alpha = 1, edgecolor="black")
plt.show()

score 0 · Accepted Answer · answered Mar 20 '21 at 11:51

There is actually a difference between your implementation and Sklearn's one: you are not using the same optimization algorithm (also called solver in sklearn), and I think the difference you observe comes from here. You are using gradient descent, while sklearn's implementation uses by default the "liblinear" solver, which is different

Indeed, different optimization algorithms can yield different results based on, as an example :

The convergence speed: As we are limiting the number of iterations, an algorithm converging slower would stop at a different minima and thus yield differents decision regions
Whether an algorithm is deterministic or not: non deterministic algorithms (such as stochastic gradient descent) can converge to a different local minima given the same dataset. With non deterministic algorithms you could observe different results with the exact same dataset and algorithm.
Hyperparameters: changing a hyperparameter (as an example the learning rate of a gradient descent algorithm) changes the behavior of the optimization algorithm too, thus leading to different results.

In you case, there are good reasons for not always getting the same results: the gradient descent algorithm you use can get stuck in a local minima (because of an insufficient number of iterations, a non optimal learning rate...) which can be different from the local minima reached by the liblinear solver.

You can observe the same kind of discrepancies if you compare sklearn's implementation with different solvers (reusing your code):


fig, ax = plt.subplots(10, 2, figsize=(10, 30))
cmap_light = ListedColormap(['#FFAAAA', '#AAFFAA', '#AAAAFF'])
cmap_bold = ListedColormap(['#FF0000', '#00FF00', '#0000FF'])
max_iter = 10000
alpha = 0.1
solver_algo_1 = 'liblinear'
solver_algo_2 = 'sag'

for n_fil in range(10):
    X_train, X0, y = create_training_set()
    xx, yy, X_test = create_test_set(X0)

    y = y.reshape(X_train.shape[0], )

    clf = LogisticRegression(solver=solver_algo_1, max_iter=max_iter).fit(X0, y)
    Z = clf.predict(np.c_[xx.ravel(), yy.ravel()])
    Z = Z.reshape(xx.shape)
    ax[n_fil, 0].pcolormesh(xx, yy, Z, cmap=cmap_light)
    ax[n_fil, 0].scatter(X0[:, 0], X0[:, 1], c=y, cmap=cmap_bold, alpha=1, edgecolor="black")

    clf = LogisticRegression(solver=solver_algo_2, max_iter=max_iter).fit(X0, y)
    Z = clf.predict(np.c_[xx.ravel(), yy.ravel()])
    Z = Z.reshape(xx.shape)
    ax[n_fil, 1].pcolormesh(xx, yy, Z, cmap=cmap_light)
    ax[n_fil, 1].scatter(X0[:, 0], X0[:, 1], c=y, cmap=cmap_bold, alpha=1, edgecolor="black")
plt.show()

As an example, with "liblinear" (left) and "newton-cg" (right), you can get this:

Though the Logistc regression implementation is the same, the difference in the optimization algorithms leads to different results. So in a few words, the difference between your implementation and Scikit learn's one is the optimization algorithm.

Now if the quality if the decsion boundary you get is not satisfying, you can try tuning the hyperparameters of your gradient descent algorithm or try changing the optimization algorithm!

Hi, thanks for your answer, very much appreciated. I got the feeling that the solver could be playing a role here. However, turning off regularization, the loss function is a plain log-loss function, which is convex. Hence, despite different solvers could be more efficient, my feeling was that they would all converge to the same unique and global minimum (at least the deterministic ones, so all except 'sag' and 'saga', I believe). Anyway, I think your answer is the key point to consider here. BTW, the default solver is 'lbfgs', at least in version 0.24.1. Thanks! — Julian Martin Del Fiore, Mar 20 '21 at 20:08
I agree, there should be no local minima in a logistic regression problem. Now, the data here is (most often) separable, so there actually isn't a minimum: the coefficients can tend to infinity and produce lower and lower losses. And (most often) this dataset has many perfect separation lines, so which one gets found depends on the solver. But one of the images linked in the question indicates clearly a suboptimal solution for a non-separable instance of the data. So I think there still must be something wrong about the implementation. — Ben Reiniger, Mar 22 '21 at 14:40

why does my own implementation of logistic regression differ from sklearn?

1 Answers1