Why is the main purpose of ResNet if the vanishing gradient problem is solved using RELU activation function?

Question

I read that ResNet solves the problem of vanishing gradient problem by using skip functions. But are they not already solved using RELU? Is there some other important thing I'm missing about ResNet or does Vanishing gradient problem occur even after using RELU?

No, the ReLU does not solve the vanishing gradient problem, it is a multi-dimensional problem (it has multiple causes). Did you read the original ResNet paper? Because it covers that very nicely. — Dr. Snoopy, May 29 '20 at 19:27
Ohh. Sorry I did not, Thanks for your response. I'll give it a read today. — krishna prasad, May 30 '20 at 03:46

score 5 · Answer 1 · answered May 29 '20 at 18:14

The ReLU activation solves the problem of vanishing gradient that is due to sigmoid-like non-linearities (the gradient vanishes because of the flat regions of the sigmoid).

The other kind of "vanishing" gradient seems to be related to the depth of the network (e.g. see this for example). Basically, when backpropagating gradient from layer N to N-k, the gradient vanishes as a function of depth (in vanilla architectures). The idea of resnets is to help with gradient backpropagation (see for example Identity mappings in deep residual networks, where they present resnet v2 and argue that identity skip connections are better at this).

A very interesting and relatively recent paper that sheds light on the working on resnets is resnets behaves as ensembles of relatively small networks. The tl;dr of this paper could be (very roughly) summarized as this: residual networks behave as an ensemble: removing a single layer (i.e. a single residual branch, not its skip connection) doesn't really affect performance, but performance decreases in an smooth manner as a function of the number of layers that are removed, which is the way in which ensembles behave. Most of the gradient during training comes from short paths. They show that training only this short paths doesn't affect performance in a statistically significant way compared to when all paths are trained. This means that the effect of residual networks doesn't really come from depth as the effect of long paths is almost non-existant.

Thanks a lot, @Ash. This explains things clearly :) – krishna prasad May 30 '20 at 03:47 — krishna prasad, May 30 '20 at 03:47

score 0 · Answer 2 · answered May 29 '20 at 17:35

The main purpose of ResNet is to create more deeper models. In theory deeper models (speaking about VGG models) must show better accuracy, but in the real life they usually do not. But if we add short-connection to the model, we can increase the number of layers and accuracy as well

score 0 · Answer 3 · answered May 29 '20 at 17:55

While the ReLU activation function does solve the problem of vanishing gradients, it does not provide the deeper layers with extra information as in the case of ResNets. The idea of propagating the original input data as deep as possible through the network hence helping the network learn much more complex features is why ResNet architecture was introduced and achieves such high accuracy on a variety of tasks.

Why is the main purpose of ResNet if the vanishing gradient problem is solved using RELU activation function?

3 Answers3