Why does NN training loss flatten?

Question

I implemented a deep learning neural network from scratch without using any python frameworks like tensorflow or keras.

The problem is no matter what i change in my code like adjusting learning rate or changing layers or changing no. of nodes or changing activation functions from sigmoid to relu to leaky relu, i end up with a training loss that starts with 6.98 but always converges to 3.24...

Why is that?

Please review my forward and back prop algorithms.Maybe there's something wrong in that which i couldn't identify.

My hidden layers use leaky relu and final layer uses sigmoid activation. Im trying to classify the mnist handwritten digits.

code:

#FORWARDPROPAGATION

for i in range(layers-1):
    
    cache["a"+str(i+1)]=lrelu((np.dot(param["w"+str(i+1)],cache["a"+str(i)]))+param["b"+str(i+1)])


cache["a"+str(layers)]=sigmoid((np.dot(param["w"+str(layers)],cache["a"+str(layers-1)]))+param["b"+str(layers)])

yn=cache["a"+str(layers)]
m=X.shape[1]
cost=-np.sum((y*np.log(yn)+(1-y)*np.log(1-yn)))/m

if j%10==0:
    print(cost)
    costs.append(cost)

#BACKPROPAGATION

grad={"dz"+str(layers):yn-y}


for i in range(layers):
    grad["dw"+str(layers-i)]=np.dot(grad["dz"+str(layers-i)],cache["a"+str(layers-i-1)].T)/m
    

    grad["db"+str(layers-i)]=np.sum(grad["dz"+str(layers-i)],1,keepdims=True)/m
    
    if i<layers-1:
        grad["dz"+str(layers-i-1)]=np.dot(param["w"+str(layers-i)].T,grad["dz"+str(layers-i)])*lreluDer(cache["a"+str(layers-i-1)])

for i in range(layers):
    param["w"+str(i+1)]=param["w"+str(i+1)] - alpha*grad["dw"+str(i+1)]
    param["b"+str(i+1)]=param["b"+str(i+1)] - alpha*grad["db"+str(i+1)]

are these two numbers 6.98 and 3.24 coming exactly same every time? — devspartan, Jul 03 '20 at 12:42
ya,it stays at 3.24 till around 10000 iterations and then it overshoots i guess(i get some error about division by zero). — Lelouche Lamperouge, Jul 03 '20 at 15:01

score 0 · Answer 1 · answered Jul 06 '20 at 08:58

The implementation seems okay. While you could converge to the same value with different models/learning rate/hyper parameters, what's frightening is having the same starting value everytime, 6.98 in your case.

I suspect it has to do with your initialisation. If you're setting all your weights initially to zero, you're not gonna break symmetry. That is explained here and here in adequate detail.

Why does NN training loss flatten?

1 Answers1