Why is gradient descent not working properly?

Question

This is my first attempt at encoding a multilayer neural network in Python (code is attached below). I'm having a hard time trying to use the gradient descent partial derivatives, because it seems that the weights are not being updated properly. When I try to predict the output of a new sample, I always get the wrong answwer (there should be two output values and a probability related to them; for example: if a new sample belongs to class 1, its probability should be more than 0.5 (prob_class1), and thus class 2 has (1-prob_class1), but the code just yields [1,1] or [-1,-1] for any sample). I've double-checked all the lines, and I'm almost sure this is due to some issues using gradient descent. Could anyone help me, please? Thank you in advance.

import numpy as np
import sklearn 
from sklearn.linear_model import LogisticRegressionCV
from sklearn.datasets import make_moons
import matplotlib.pyplot as plt

np.random.seed(0)
x, y = sklearn.datasets.make_moons(200, noise=0.20)
plt.scatter(x[:,0], x[:,1], s=40, c=y, cmap=plt.cm.Spectral)
y = y.reshape(-1,1)
N = x.shape[0]

n_input = min(x.shape)
n_output = 2
n_hidden = max(n_input,n_output) + 20 # 20 is arbitrary
n_it = 10000 
alpha = 0.01

def predict(model,xn):
    W1, b1, W2, b2, W3, b3 = model['W1'], model['b1'], model['W2'], model['b2'],model['W3'], model['b3']
    z1 = W1.dot(xn) + b1
    a1 = np.tanh(z1)
    z2 = a1.dot(W2) + b2
    a2 = np.tanh(z2)
    z3 = a2.dot(W3) + b3
    a3 = np.tanh(z3)

    return a3

model = {}

W1 = np.random.randn(n_input,n_input)
b1 = np.random.randn(1,n_input)
W2 = np.random.randn(n_input,n_hidden)
b2 = np.random.randn(1,n_hidden)
W3 = np.random.randn(n_hidden,n_output)
b3 = np.random.randn(1,n_output)

for i in range(n_it):

    # Feedforward:
    z1 = x.dot(W1) + b1
    a1 = np.tanh(z1)
    z2 = a1.dot(W2) + b2
    a2 = np.tanh(z2)
    z3 = a2.dot(W3) + b3
    a3 = np.tanh(z3)


    # Loss function:
    # f(w,b) = (y - (w*x + b)^2)
    # df/dw = -2*(1/N)*x*(y - (w*x + b))
    # df/db = -2*(1/N)*(y - (w*x + b))

    # Backpropagation:
    dW3 = -2*(1/N)*(a2.T).dot(y-a3)
    db3 = -2*(1/N)*sum(y-a3)
    db3 = db3.reshape(-1,1)
    db3 = db3.T
    dW2 = -2*(1/N)*a1.T.dot(a2)
    db2 = -2*(1/N)*sum(a2)
    db2 = db2.reshape(-1,1)
    db2 = db2.T
    dW1 = -2*(1/N)*(x.T).dot(a1)
    db1 = -2*(1/N)*sum(dW1)
    db1 = db1.reshape(-1,1)
    db1 = db1.T

    # Updating weights
    W3 += alpha*dW3
    b3 += alpha*db3
    W2 += alpha*dW2
    b2 += alpha*db2
    W1 += alpha*dW1
    b1 += alpha*db1

model = { 'W1': W1, 'b1': b1, 'W2': W2, 'b2': b2, 'W3':W3, 'b3':b3}
test = np.array([2,0])
prediction = predict(model,test)

score 0 · Accepted Answer · answered Jan 30 '19 at 03:23

A couple of things that come to my mind looking at your code:

First, you are not using the chain rule to compute the backpropagation. If you want some intuitive understanding of this, you can watch this great class by Andrej Karpathy https://www.youtube.com/watch?v=i94OvYb6noo, but there are also plenty of ressources online. Maybe start with 1 hidden layer (you have 2 here) as it makes things much easier.

Second, you should also use the derivative of your tanh in the backprop (you are doing this in the forward propagation, so it should also be done the other way around).

Finally, why would you have two output nodes? It seems to me like output_1 = 1 - output_2 in this case. Or, if you want the two outputs to be computed separately, you would need to normalize them in the end to get a probability of belonging to class 1 or 2.

That is the answer I was looking for, thank you very much. – lheureduloup Jan 30 '19 at 09:37 — lheureduloup, Jan 30 '19 at 09:37
Sure! Good luck with this :) – MaximeKan Jan 30 '19 at 14:03 — MaximeKan, Jan 30 '19 at 14:03

Why is gradient descent not working properly?

1 Answers1