-1

I am trying to understand the Gradient Descent Algorithm.

The code here should choose a more optimal line of best fit, given another line of best fit. The function takes the current line-of-best-fit's slope and y-intercept as inputs, as well as a 2-D data set names "points" and a learningRate. This is the code I am working with:

def step_gradient(b_current, m_current, points, learningRate):
    b_gradient = 0                                                      #Initialize b_gradient to 0
    m_gradient = 0                                                      #Initialize m_gradient to 0
    N = float(len(points))                                              #Let N be the number of data points
    for i in range(0, len(points)):                                     #Iterate through dataset "Points"
        x = points[i,0]
        y = points[i,1]
        b_gradient += -(2/N) * (y - ((m_current * x) + b_current))      #gradient is calculated as the derivative
        m_gradient += -(2/N) * x * (y - ((m_current * x) + b_current))
    new_b = b_current - (learningRate * b_gradient)
    new_m = m_current - (learningRate * m_gradient)
    return[new_b, new_m]

However I do not understand what is happening inside the for loop.

I understand that the first two lines of the for loop will iteratively assign x and y to the next data point in the data set named "points".

I do not understand how b_gradient and m_gradient are being calculated.

To my understanding, b_gradient is the sum of all partial derivatives with respect to b, for every point in the data set. However, my real question, is how does line:

b_gradient += -(2/N) * (y - ((m_current * x) + b_current))

calculate the partial derivative with respect to b?

What is the -(2/N) for??

Can someone please explain how on earth this line of code represents the partial derivative with respect to b, of a point in this dataset?

Same confusion for m_gradient.

2 Answers2

2

The b_gradient and m_gradient are the partial derivatives of the cost/error function with respect to b/m. That's why there is a -2/N as 1/N is part of the cost/error function, and it is multiplied by 2 after computing the derivative.

If you don't know calculus, you will just have to take that in for now. If you do, then it is pretty easy to derive.

Frank
  • 414
  • 4
  • 15
  • Ohhhhhhhhh that makes so much sense. I do know calculus, and youre right, I just wasnt making that connection between what the cost function looks like. – ParksideAdrian Jul 14 '18 at 03:41
  • Remember, gradient descent is only an algorithm to minimize the cost function. If you can see which direction would take your cost function the most "downhill" (the derivative), you can easily minimize it for something as simple as linear regression. – Frank Jul 14 '18 at 03:48
0

The contribution to the cost(loss) from each data point (xi, yi), in your system is .

`Li = (1/N) * (yi - (m*xi + b))**2

The total cost will be the sum of all Li-s. You have N data points and (1/N) is a normalizing term, so that the value of your cost is consistent even if you change N.

Now partial differentiation of Li wrt m gives

Li_m = (1/N) *2 * (yi - (m*xi + b)) * -xi
     =  -(2/N) * xi * (yi - (m*xi + b))

And partial differentiation of Li wrt b gives

Li_b = (1/N) * 2 * (yi -(m*xi + b)) * -1
     =  (-2/N) * (yi - (m*xi + b))
koshy george
  • 671
  • 6
  • 24