0

Helo, I'm learning machine learning from first principle, so i coded up logistic regression with back prop from scratch using numpy and calculus. Updating derivative with weighted average (momentum) works for me, but not RMSProp or Adam as the cost doesn't drop. I'm I doing something wrongly?

The main block for Adam is this

# momentum
VW = beta*VW + (1-beta)*dW
Vb = beta*Vb + (1-beta)*db
# rmsprop
SW = beta2*SW + (1-beta2)*dW**2
Sb = beta2*Sb + (1-beta2)*db**2
# update weight TODO: Adam doesntwork>
W -= learning_rate*VW/(np.sqrt(SW)+epsilon)
b -= learning_rate*Vb/(np.sqrt(Sb)+epsilon)

The full code like so

# load dataset breast cancer
import sklearn
from sklearn import *
import numpy as np
import matplotlib.pyplot as plt
X,y = sklearn.datasets.load_breast_cancer(return_X_y=True, as_frame=False)
# scaling input
X = (X-np.mean(X,0))/np.std(X,0)
# avoid rank 1 vector
y = y.reshape(len(y),1)

# stat
m = X.shape[0]
n = X.shape[1]

# hyper parameters
num_iter = 20000
learning_rate = 1e-6 
beta = 0.9
beta2 = 0.999
epsilon = 1e-8

# init
np.random.seed(42)
W = np.random.randn(n,1)
b = np.random.randn(1)
y = y.reshape(len(y),1)
VW = np.zeros((n,1))
Vb = np.zeros(1)
SW = np.zeros((n,1))
Sb = np.zeros(1)


for i in range(num_iter):
    # forward

    Z = X.dot(W) + b # m,nclass

    # sigmoid
    A = 1/(1+np.exp(-Z))
    # categorical cross-entropy
    # cost = -np.sum(y*np.log(A))/m

    # binary classification cost
    j = (-y*np.log(A)- (1-y)*np.log(1-A)).sum()*(1/m)
    
    if i % 1000 == 999:
        print(i, j)
    
    # backward

    # derivative respect to j
    dA = (A-y)/(A*(1-A))
    dZ = A-y
    
    dW = X.transpose().dot(dZ)
    db = dZ.sum()
    # momentum
    VW = beta*VW + (1-beta)*dW
    Vb = beta*Vb + (1-beta)*db
    # rmsprop
    SW = beta2*SW + (1-beta2)*dW**2
    Sb = beta2*Sb + (1-beta2)*db**2
    # update weight TODO: Adam doesntwork>
    W -= learning_rate*VW/(np.sqrt(SW)+epsilon)
    b -= learning_rate*Vb/(np.sqrt(Sb)+epsilon)

print(sklearn.metrics.classification_report(y,np.round(A),target_names=['benign','malignant']))

It turns out that for this particular problem, RMSProp/Adam takes much longer time to converge compares to just gradient descent. My implementation is correct.

superbik
  • 1
  • 2
  • It looks you are not overwriting the momentum and rmsprop correctly in each iteration. Try writing ```VW[:] = ...``` instead of ```VW = ...```. – Kevin Apr 13 '21 at 19:44
  • Since these parameters depend on its value from the previous iteration, you are just taking into account the initialized value. Something like ```*=``` On the other hand will do the operation in-place. – Kevin Apr 13 '21 at 19:48
  • helo Kevin, thanks for your comments. – superbik Apr 14 '21 at 03:16
  • Numpy arrays doesn't need slicing, if we use operator with a scalar it will get broadcast to the whole array. [(https://numpy.org/doc/stable/user/basics.broadcasting.html) ]. Initial value for V and S are zeros, so it takes a while to update. However even with bias correction ' (SW /= 1-beta2**i) ' rmsprop just doesn't work for me. – superbik Apr 14 '21 at 03:26
  • You say that the cost doesn't drop. Assuming you mean `j`, it appears that it *is* dropping. It is just very gradual due to the learning rate of 10^-6. Increasing either the learning rate or the number of iterations by a couple of orders should help you reach a much lower cost. – whydoubt Apr 14 '21 at 05:15
  • Thanks you're right. It turns out that for this particular problem, RMSProp/Adam takes much longer time to converge compares to pure gradient descent or gradient descent with momentum at same learning rate. At 10^-6 it appears stand still to me while gradient descent already converges. – superbik Apr 15 '21 at 02:55

0 Answers0