Helo, I'm learning machine learning from first principle, so i coded up logistic regression with back prop from scratch using numpy and calculus. Updating derivative with weighted average (momentum) works for me, but not RMSProp or Adam as the cost doesn't drop. I'm I doing something wrongly?
The main block for Adam is this
# momentum
VW = beta*VW + (1-beta)*dW
Vb = beta*Vb + (1-beta)*db
# rmsprop
SW = beta2*SW + (1-beta2)*dW**2
Sb = beta2*Sb + (1-beta2)*db**2
# update weight TODO: Adam doesntwork>
W -= learning_rate*VW/(np.sqrt(SW)+epsilon)
b -= learning_rate*Vb/(np.sqrt(Sb)+epsilon)
The full code like so
# load dataset breast cancer
import sklearn
from sklearn import *
import numpy as np
import matplotlib.pyplot as plt
X,y = sklearn.datasets.load_breast_cancer(return_X_y=True, as_frame=False)
# scaling input
X = (X-np.mean(X,0))/np.std(X,0)
# avoid rank 1 vector
y = y.reshape(len(y),1)
# stat
m = X.shape[0]
n = X.shape[1]
# hyper parameters
num_iter = 20000
learning_rate = 1e-6
beta = 0.9
beta2 = 0.999
epsilon = 1e-8
# init
np.random.seed(42)
W = np.random.randn(n,1)
b = np.random.randn(1)
y = y.reshape(len(y),1)
VW = np.zeros((n,1))
Vb = np.zeros(1)
SW = np.zeros((n,1))
Sb = np.zeros(1)
for i in range(num_iter):
# forward
Z = X.dot(W) + b # m,nclass
# sigmoid
A = 1/(1+np.exp(-Z))
# categorical cross-entropy
# cost = -np.sum(y*np.log(A))/m
# binary classification cost
j = (-y*np.log(A)- (1-y)*np.log(1-A)).sum()*(1/m)
if i % 1000 == 999:
print(i, j)
# backward
# derivative respect to j
dA = (A-y)/(A*(1-A))
dZ = A-y
dW = X.transpose().dot(dZ)
db = dZ.sum()
# momentum
VW = beta*VW + (1-beta)*dW
Vb = beta*Vb + (1-beta)*db
# rmsprop
SW = beta2*SW + (1-beta2)*dW**2
Sb = beta2*Sb + (1-beta2)*db**2
# update weight TODO: Adam doesntwork>
W -= learning_rate*VW/(np.sqrt(SW)+epsilon)
b -= learning_rate*Vb/(np.sqrt(Sb)+epsilon)
print(sklearn.metrics.classification_report(y,np.round(A),target_names=['benign','malignant']))
It turns out that for this particular problem, RMSProp/Adam takes much longer time to converge compares to just gradient descent. My implementation is correct.