what is the best choice for an activation function in case of small sized neural networks

Question

I am using pytorch and autograd to build my neural network architecture. It is a small 3 layered network with a sinngle input and output. Suppose I have to predict some output function based on some initial conditions and I am using a custom loss function.

The problem I am facing is:

My loss converges initially but gradients vanish eventually.
I have tried sigmoid activation and tanh. tanh gives slightly better results in terms of loss convergence.
I tried using ReLU but since I don't have much weights in my neural network, the weights become dead and it doesn't give good results.

Is there any other activation function apart from sigmoid and tanh that handles the problem of vanishing gradients well enough for small sized neural networks? Any suggestions on what else can I try

Can you elaborate on "My loss converges initially but gradients vanish eventually"? If the loss has converged, then the gradient should be close to zero. Then the small gradient is nothing bad, rather a good sign that your model is already optimal. — Kota Mori, Sep 19 '21 at 12:43
Yeah but as far as i see it is not reaching the global minima which i really want out of it, the gradients of the weights in the first layer become zero before that. — unstableEquilibrium, Sep 19 '21 at 13:00

score 0 · Accepted Answer · answered Sep 19 '21 at 12:21

In the deep learning world, ReLU is usually prefered over other activation functions, because it overcomes the vanishing gradient problem, allowing models to learn faster and perform better. But it could have downsides.

Dying ReLU problem

The dying ReLU problem refers to the scenario when a large number of ReLU neurons only output values of 0. When most of these neurons return output zero, the gradients fail to flow during backpropagation and the weights do not get updated. Ultimately a large part of the network becomes inactive and it is unable to learn further.

What causes the Dying ReLU problem?

High learning rate: If learning rate is set too high, there is a significant chance that new weights will be in negative value range.
Large negative bias: Large negative bias term can indeed cause the inputs to the ReLU activation to become negative.

How to solve the Dying ReLU problem?

Use of a smaller learning rate: It can be a good idea to decrease the learning rate during the training.
Variations of ReLU: Leaky ReLU is a common effective method to solve a dying ReLU problem, and it does so by adding a slight slope in the negative range. There are other variations like PReLU, ELU, GELU. If you want to dig deeper check out this link.
Modification of initialization procedure: It has been demonstrated that the use of a randomized asymmetric initialization can help prevent the dying ReLU problem. Do check out the arXiv paper for the mathematical details

Sources:

Practical guide for ReLU

ReLU variants

Dying ReLU problem

Thank you but Would you suggest increasing the number of weights in my nn layers before using Leaky ReLU? Right now my weight matrices are (1,20) and (20,1) or would you suggest increasing the number of hidden layers? Does it have an impact — unstableEquilibrium, Sep 19 '21 at 13:03
It depends on your datasets. You can check if model is underfitting/overfitting by train and validation error. If it is underfitting use more layers/neurons, in case of overfitting use less layers/neurons. I usually use power of 2 neuron numbers and decrease it in each layer. For example: 64-32-16-8-1. There is no golden bullet for layer and neuron numbers, you need to experiment with it. — Péter Szilvási, Sep 19 '21 at 20:36

what is the best choice for an activation function in case of small sized neural networks

1 Answers1