-4

I am training MNIST on 8 layers (1568-784-512-256-128-64-32-10) fully-connected deep neural network with the newly created activation function as shown in the figure below.This function looks a bit similar to the ReLU, however, it gives a litter curve at the "kink".

It was working fine when I used it to train 5 layers, 6 layers and 7 layers fully-connected neural networks. The problem arises when I use it in 8 layers fully-connected neural networks. Where it will only learn at the 1st few epochs then stop learning (Test Loss gives "nan" and Test accuracy drop to 9.8%). Why does this happen?

My other configurations are as follow: Dropout=0.5, Weight initialization= Xavier initialization, Learning rate=0.1

enter image description here

enter image description here

Joshua
  • 409
  • 1
  • 4
  • 12

1 Answers1

2

I believe this is called Gradient vanishing problem which usually occurs in deep network. There is no hard and fast rule for solving it. My advice would be to reshape your network architecture

See here [Avoiding vanishing gradient in deep neural networks

Akshay Bahadur
  • 497
  • 4
  • 11
  • Hi, how can I know it suffers from gradient vanishing or exploding? Is there any way to identify it? – Joshua Apr 25 '18 at 07:35
  • 1
    yes... If your loss is going to nan with some error message. That means your gradients are vanishing. Try decreasing the learning rate first.. Make the learning rate very small. Try gong for learning rate decay as well. If nothing above works, go for a different architecture. – Akshay Bahadur Apr 25 '18 at 12:19