1

I'm exploring the Scikit-learn logistic regression algorithm. I understand that as part of the training, the algorithm builds a regression curve where the y-variable ranges from 0 to 1 (sigmoid S-curve). The y-variable is a continuous variable here (although in reality it is a discrete variable). .

How is the algorithm able to learn the S-curve, when the training dataset reflects reality and includes the y-variable as a discrete variable? There is no probability estimate in the training, so I'm wondering how is the algorithm able to learn the S-curve.

museshad
  • 498
  • 1
  • 8
  • 18
  • If either I or the later responder answered your question, could you please help us and future people who discover the question, by marking the answer with the checkmark? If not, what can we clarify in our answers? – Arya McCarthy Jan 22 '21 at 21:06
  • Hey @museshad, did any of the answers help? – Arturo Sbr Mar 12 '21 at 22:15

2 Answers2

0

There is no probability estimate in the training

Sure, but we pretend there is for modeling purposes. We want to maximize the probability of, as you call it, “reality”—if the observed response (the discrete value you refer to) is a 0, we want to predict that with probability 1; similarly, if the response is a 1, we want to predict that with probability 1.

Fitting the model to one data point, getting the right answer with probability 1, would be easy. Of course, we have more than one data point. We have to balance concerns between these. We want the predicted value sigmoid(weights * features) to be close to the true response (0 or 1) for all of the data points, but there may not be a way to set the parameters of the model to achieve this. (That is, the data may not be linearly separable.)

Arya McCarthy
  • 8,554
  • 4
  • 34
  • 56
0

Good question! The fitting process in logistic regression is a search procedure that seeks the beta coefficients that minimize the error in the probabilities predicted by the model (continuous values) and the data (discrete values).

In logistic regression, you model probabilities using a logistic function (also known as a sigmoid function):

XB = B0 + B1 * X1 + B2 * X2 + ... + BN * XN
p(X) = e^(XB) / (1 + e^(XB))

The algorithm tries to find the beta coefficients that minimize the error using Maximum Likelihood estimation. The function to be minimized is called the cost function, and it can be any number of things. The most common ones are:

  1. sum (P(X_i) - y_i)^2
  2. sum |P(X_i) - y_i|

A random set of betas is picked at random, the cost is calculated and the algorithm will pick a new set of betas that will result in a lower cost. The algorithm stops searching for new betas when the decrease in cost is smaller than a given threshold (set by the tol parameter in sklearn).

The way the model converges to the final set of coefficients depends on the solver parameter. Each solver has a different way of converging to the final set of betas, but they usually converge to the same results.

Arturo Sbr
  • 5,567
  • 4
  • 38
  • 76