I am training MNIST on 8 layers (1568-784-512-256-128-64-32-10) fully-connected deep neural network with the newly created activation function as shown in the figure below.This function looks a bit similar to the ReLU, however, it gives a litter curve at the "kink".
It was working fine when I used it to train 5 layers, 6 layers and 7 layers fully-connected neural networks. The problem arises when I use it in 8 layers fully-connected neural networks. Where it will only learn at the 1st few epochs then stop learning (Test Loss gives "nan" and Test accuracy drop to 9.8%). Why does this happen?
My other configurations are as follow: Dropout=0.5, Weight initialization= Xavier initialization, Learning rate=0.1