Simple gradient descent in python and numpy

Question

I am trying to implement a simple gradient descent in python with only using numpy but something is missing and I can not find it. I have done it again in the past but someway I have been staring at this problem the past day without being able to make it work.

import numpy as np
from matplotlib import pyplot as plt
import pandas as pd
from sklearn.datasets import load_digits
from sklearn.metrics import accuracy_score
from sklearn.utils import shuffle

mnist = load_digits()
plt.imshow(mnist.images[0], cmap="gray")
plt.show()

def init_param():
    w1 = np.random.rand(10, 64)
    b1 = np.random.rand(10, 1)
    w2 = np.random.rand(10, 10)
    b2 = np.random.rand(10, 1)
    return w1, b1, w2, b2

def ReLU(z):
    return np.maximum(0, z)

def dReLU(z):
    return z > 0

def sigmoid(z):
    return 1 / (1 + np.exp(-z))

def dsigmoid(z):
    return sigmoid(z) * (1 - sigmoid(z))

def forward_prop(w1, b1, w2, b2, x):
    z1 = w1.dot(x).reshape(-1,1) + b1
    a1 = ReLU(z1)
    z2 = w2.dot(a1) + b2
    a2 = sigmoid(z2)
    return z1, a1, z2, a2

def one_hot(y):
    one_hot_y = np.zeros((y.size, 10))
    one_hot_y[np.arange(y.size), y] = 1
    return one_hot_y.T

def back_prop(z1, a1, z2, a2, w2, x, y):
    one_hot_y = one_hot(y)
    
    dz2 = 2 * (a2 - one_hot_y) * dsigmoid(z2)
    dw2 = dz2.dot(a1.T) 
    db2 = np.sum(dz2, 1, keepdims=True)

    dz1 = w2.T.dot(dz2) * dReLU(z1)
    dw1 = dz1.dot(x.T)
    db1 = np.sum(dz1, 1, keepdims=True)

    return dw1, db1, dw2, db2

def update_param(w1, b1, w2, b2, dw1, db1, dw2, db2, lr):
    w1 = w1 - lr*dw1
    b1 = b1 - lr*db1
    w2 = w2 - lr*dw2
    b2 = b2 - lr*db2
    return w1, b1, w2, b2

def gradient_descent(X, Y, max_iter, lr):
    w1, b1, w2, b2 = init_param()
    m = len(X)
    for _ in range(max_iter):
        y_hat = []
        dw1, db1, dw2, db2 = np.zeros_like(w1), np.zeros_like(b1), np.zeros_like(w2), np.zeros_like(b2)
        for x, y in zip(X, Y):
            x = x.reshape(-1,1)
            z1, a1, z2, a2 = forward_prop(w1, b1, w2, b2, x)
            _dw1, _db1, _dw2, _db2 = back_prop(z1, a1, z2, a2, w2, x, y)
            dw1 += _dw1
            db1 += _db1
            dw2 += _dw2
            db2 += _db2
            y_hat.append(np.argmax(a2))
        w1, b1, w2, b2 = update_param(w1, b1, w2, b2, dw1, db1, dw2, db2, lr*(1/m))
        print(accuracy_score(Y, y_hat), end='\r')
    return w1, b1, w2, b2

mnist.data, mnist.target = shuffle(mnist.data, mnist.target)
data, labels = mnist.data, mnist.target
data = data / data.max()

Xtrain, Xvalid, Xtest = data[:1000], data[1000:2000], data[2000:]
Ytrain, Yvalid, Ytest = labels[:1000], labels[1000:2000], labels[2000:]

w1, b1, w2, b2 = gradient_descent(Xtrain, Ytrain, 100, 0.1)

The model is not training. It is stuck on a specific error value. I have checked the dimensions of the matrixes and arrays and these are correct. The issue must be on the math of the model but I can not figure it out unfortunately.

There is no error message, It just doesn't train. The accuracy of the model is stuck at 0.092. That's why I suspect I did something wrong on the math side — ExpL0siV3Man79, May 21 '23 at 01:21
Oh, ok. I thought you mentioned a specific error value, but I realise you probably meant just that it was incorrect. — David, May 21 '23 at 01:22
It is the accuracy score from sklearn metrics. I used this to make sure it wasn't my function causing the problem — ExpL0siV3Man79, May 21 '23 at 01:25
this looks like a full blown ML model on the mnist dataset. Tagging as simple gradient descent does not describe the real size of the code or task. — hpaulj, May 21 '23 at 15:24
There is no other issue on the code, except for the gradient descent part. Thus I chose this tag. I don't need help with loading the MNIST data set or getting the metrics or anything else. — ExpL0siV3Man79, May 21 '23 at 15:43
@hpaulj given that he is implementing it manually, describing it as gradient descent seems perfectly reasonable. Also, I'd hardly describe a 2 layer model as a large task. The focus is clearly on the underlying implementation of gradient descent. — David, May 22 '23 at 13:45
@David, as long as he's got the attention of someone with knowledge and time help, the title and tags are fine. But it's stale enough that he shouldn't expect much new attention. — hpaulj, May 22 '23 at 16:04
@hpaulj I think the problem is that the initial problem was indeed a gradient descent problem: He was using `tanh` instead of its derivative in his `back_prop` function. He then updated the question fixing a few of the suggestions which probably left the tags a little inadequate. I had already seen the original problem so it made sense to me. — David, May 22 '23 at 17:07

David · Answer 1 · 2023-05-22T16:52:15.503

I can see a few potential problems in your code:

You're using tanh in your back_prop function, when you should be using dtanh.
Your one_hot function doesn't match your tanh activation function: tanh produces -1 and 1 (with a smooth transition in between). Your one_hot function produces 0 and 1. You should use something like softmax instead of tanh as your activation function. You may be able to modify your one_hot function instead, but this isn't normally done.

Another potential issue is that your initial weights may be too large: a lot of your calculations are resulting in numbers whose absolute value is large. Maybe try reducing your initial weights. There are also specific methods in the literature for initialising weights. It may be good to take a look at these. Also, you may be better initialising your weights with a mean of zero, so your network has negative weights too.

Edit: I tested changing the initialisation method to a normal distribution around 0 with a scale of 0.1 and your model trains correctly. I got it to 97% training accuracy after 1000 epochs with a learning rate of 0.5. That initialisation looks like:

np.random.normal(0, 0.1, (10, 64))

I will update the code above. I made some of the changes regarding the tanh. The (a2 - one_hot_y) is the gradient of (a2 - one_hot_y)**2 (I know I forgot to multiply by 2 but this shouldn't be the problem). Anyway I will post the full code so hopefully someone can test it and detecte the issue. — ExpL0siV3Man79, May 21 '23 at 12:01
@ExpL0siV3Man79 I think you're right about the loss function, I just didn't recognise it there. I've updated my answer with the actual implementation of my last suggestion which solves your issue. — David, May 22 '23 at 16:54

Simple gradient descent in python and numpy

1 Answers1