I am using pytorch and autograd to build my neural network architecture. It is a small 3 layered network with a sinngle input and output. Suppose I have to predict some output function based on some initial conditions and I am using a custom loss function.
The problem I am facing is:
My loss converges initially but gradients vanish eventually.
I have tried sigmoid activation and tanh. tanh gives slightly better results in terms of loss convergence.
I tried using ReLU but since I don't have much weights in my neural network, the weights become dead and it doesn't give good results.
Is there any other activation function apart from sigmoid and tanh that handles the problem of vanishing gradients well enough for small sized neural networks? Any suggestions on what else can I try