1

I have been trying to program a gradient descent algorithm in R for logistic regression in order to understand it better. In Andrew NG's ML course they seem to skip this part and instead show the advanced optimization approach. However, I would like to recreate the gradient descent approach myself. Here is my attempt:

###my data

X <- c(34.62366, 30.28671, 35.84741, 60.18260, 79.03274)
X <- cbind(rep(1,5),X)
y <- c(0, 0, 0, 1, 1)

###sigmoid function to calculate predicted probabilities

sigmoid <- function(z) {
  #SIGMOID Compute sigmoid function
  z <- as.matrix(z)
  g <- matrix(0,dim(z)[1],dim(z)[2])
  g <- 1 / (1 + exp(-1 * z))
  g
}


###Gradient Descent 

theta <-  c(0,0)
iterations <- 15000
alpha <- 0.02
m <- length(y)

for (i in 1:iterations) {

  theta_prev = theta

  p = dim(X)[2]

  for (j in 1:p) {
    h <- sigmoid(X %*% theta_prev)

    #sigmoid derivative
    deriv <- (t(h - y) %*% X[,j]) / m

    theta[j] = theta_prev[j] - (alpha * deriv)
  }
}

This gives me the final coefficients of -11.95 and 0.24, whereas using the GLM function in R I get -90.87 and 1.89. Does anyone know where my code goes wrong?

Here's the code for the GLM model:

X <- X[,2]
mod <- glm(y ~ X, family = 'binomial')
coef(mod)

Thanks in advance!

EDIT: With this larger dataset which doesn't have perfect separation, the discrepancy between coefficients remains. Also, the discrepancy remains with even larger datasets of 100 observations.

X <- c(34.62366, 30.28671, 35.84741, 60.18260, 79.03274, 45.08328, 61.10666,
   75.02475, 76.09879, 84.43282, 95.86156, 75.01366, 82.30705, 69.36459, 39.53834)
X <- cbind(rep(1,5),X)
y <- c(0, 0, 0, 1, 1, 0, 1, 1, 1, 1, 0, 0, 1, 1, 0)

With this slightly larger dataset my attempt returns coefficients of -18,46 and 0.15 whilst R's GLM returns -4.12 and 0.07.

pd441
  • 2,644
  • 9
  • 30
  • 41

1 Answers1

4

the problem you are seeing is due to your data. You have data that can be separated by several planes. Check this discussion http://r.789695.n4.nabble.com/glm-fit-quot-fitted-probabilities-numerically-0-or-1-occurred-quot-td849242.html

Please note that when I try glm() I get a warning

glm.fit: glm.fit: "fitted probabilities numerically 0 or 1 occurred"

This should give you a hint that something is not correct. Basically you will find out that there are infinite planes that can separate your points (you have all 0 on the left and all 1 on the right side of the axis so to say). In the discussion I refer in the link is well explained. Your self-developed GD returns different values depending on your starting values (try!) since there are several that are ok... Starting with

theta <-  c(20,20)

Will give you

> theta
[1] -18.6533438   0.3883605

In the plot you see three lines I got from your method with different starting conditions and as you can see they all separate your points very well...

enter image description here

I hope it helps. Best, Umberto

EDIT: Having looked at your data I would say that your data are not linearly separable (the opposite of what your initial data suggested). The model given by glm is not really working. Check with summary(mod)

Coefficients:
            Estimate Std. Error z value Pr(>|z|)  
(Intercept) -4.11494    2.32945  -1.766   0.0773 .
X[, 2]       0.06759    0.03527   1.916   0.0553 .

Check the errors and the z values... SO personally I would not give much weight to the results you are getting from glm... And your code gives result that depends (as expected) from the initial values... By the way to get a stable result with your code and hyperparameters you need more iterations... still checking a bit. Will update the answer as soon as find more.

EDIT 2: got something. If you use the following parameters

theta <-  c(-4,0.05)
iterations <- 1000000
alpha <- 0.001

from your method you get

> theta
[1] -4.11500250  0.06758884

and from glm you get

> coef(mod)
(Intercept)      X[, 2] 
-4.11493568  0.06758787 

So the same values (well, very very close to each other). Now note that if you use the initial parameters c(0,0) you still get the same result... So is a matter of learning rate (if you choose it too big your parameters do not converge). I checked the behaviour of the values for theta and saw that the parameters were oscillating between two values, a clear sign that the learning rate was too big. In addition you needed a lot more iterations to converge...

In the plot the behaviour of the intercept vs. the number of iterations to give you an idea...

enter image description here

Hope that helps, Umberto

Umberto
  • 1,387
  • 1
  • 13
  • 29
  • Thank you for your reply! However, I still get differing coefficients when I use a larger data set which isn't perfectly separated... – pd441 Mar 29 '18 at 09:30
  • Can you post the data somewhere? Is a big data file? – Umberto Mar 29 '18 at 09:37
  • Yeah, I just updated my post with a larger dataset, Your response makes perfect sense to me, however I would imagine that when perfect separation is not an issue, then the coefficients should be the same. FYI, I also get differences in coefficients with much larger datasets, however I just can't upload such large dataset with the question here. – pd441 Mar 29 '18 at 09:41
  • the example with 15 observation might be enough, but if you want the csv I can post it, but I don't know where to post it or how to do that. Stack overflow doesn't allow one to upload data. – pd441 Mar 29 '18 at 09:42
  • Thanks for the answer! So the issue is just that the number of iterations was too low and the alpha was too high (?) So of course you set the initial theta's to -4 and 0.05 in light of what we can already know from R's `GLM` results. Does this mean that given a much larger amount of time, that we could arrive at the correct values of theta starting from 0, 0? It just being that it would really take ages! Second thing, is this probably the reason why in the Andrew NG's ML course, that they just go straight to the advanced optimization methods, as the gradient descent approach takes too long? – pd441 Mar 29 '18 at 10:53
  • As I mentioned I tried starting from 0,0 and you still get the right results. The problem with plain GD is not that it takes too long (that is also a problem) but that, as you have seen, is not easy to get it to converge... But a complete discussion would take too much space here... – Umberto Mar 29 '18 at 12:06
  • Ah of course, I missed that you said that, the new solution works well! – pd441 Mar 29 '18 at 12:30