If we primarily use LSTMs over RNNs to solve the vanishing gradient problem, why can't we just use ReLUs/leaky ReLUs with RNNs instead?

Question

We all knows that vanishing gradient problem occurs when we are using deep neural network with sigmoid and if we use relu , it solves this problem but it creates dead neuron problem and then it solves by leaky relu . Why we moves toward LSTM if there is a problem of vanishing gradient problem in RNN . Why we cannot use just relu to resolves it.

score 0 · Answer 1 · answered Jan 18 '21 at 08:16

It's not just the vanishing gradient, RNN also finds itself with the problem of exploding gradients as well (as the output constantly gets feed again as the input resulting exponential blowup or shrinkage of the gradients).

You're right leaky relu could be a solution for the vanishing gradient problem, however with ReLU and Leaky ReLU there comes the problem of exploding gradients (gradient blows up) which isn't very prevalent with feed-forward neural nets. Because If you see the depth of a quite deep feed-forward neural net, It is still pretty small (almost nothing) compared to the depth of RNNs, RNNs are very deep in nature, thus there comes the problem of exploding gradients. This is the reason we avoid using ReLU and use Tanh activation, If you may ask why not sigmoid? because (if you look at the gradient plot of both sigmoid and Tanh), hyperbolic tangent has better gradients than sigmoid:

σ′(x)=σ(x)(1−σ(x))≤0.25

tanh′(x)=sech2(x)=2exp(x)+exp(−x))≤1.0

Nonetheless, your intuition is correct that ReLU with RNN could have been a contender to those fancy LSTMs and GRUs, however, I believe many researchers tried this combination but it takes too much effort (careful weight initialization, cautious handling of learning rates) which wasn't worth it and no benefit over LSTM/GRUs.

I understand , One more thing , that Is Relu function give exploding gradient in this architecture mainly (RNN) or it may generally occurs ? Because I read on Net that exploding gradient problem occurs due to weights initialization (occurs very large) — Hamza, Jan 18 '21 at 10:00
The function alone doesn't cause the issue and you're right weight initialization plays a big role in it too, along with that people use other methods like gradient clipping, weight regularizers, etc. to mitigate the issue. As I said at the end of the answer, the researchers who have tried ReLU with RNN took extra care with careful weight initialization so that the problem doesn't occur. And yeah, it generally occurs in RNN type sequential networks as they're quite deep in nature. — Khalid Saifullah, Jan 18 '21 at 10:20
The best option is to however try these out yourself, as deep learning is said to be an empirical science, so there's no "perfect/correct" way of doing things. So you can build one yourself and see how it performs which will convince you much more than any answer. Lastly, If it answers your question, consider selecting it as the answer so that others find it useful as well. Thanks — Khalid Saifullah, Jan 18 '21 at 10:23

If we primarily use LSTMs over RNNs to solve the vanishing gradient problem, why can't we just use ReLUs/leaky ReLUs with RNNs instead?

1 Answers1