5

I was reading the Deep Residual Network paper and in the paper there is a concept that I cannot fully understand:

enter image description here

Question:

  1. What does it mean by "hope the 2 weight layers fit F(x)" ?

  2. Here F(x) is processing x with two weight layers(+ ReLu non-linear function), so the desired mapping is H(x)=F(x)? where is the residual?

Johnnylin
  • 507
  • 2
  • 7
  • 26

1 Answers1

6

What does it mean by "hope the 2 weight layers fit F(x)" ?

So the residual unit shown obtains F(x) by processing x with two weight layers. Then it adds x to F(x) to obtain H(x). Now, assume that H(x) is your ideal predicted output which matches with your ground truth. Since H(x) = F(x) + x, obtaining the desired H(x) depends on getting the perfect F(x). That means the two weight layers in the residual unit should actually be able to produce the desired F(x), then getting the ideal H(x) is guaranteed.

Here F(x) is processing x with two weight layers(+ ReLu non-linear function), so the desired mapping is H(x)=F(x)? where is the residual?

First part is correct. F(x) is obtained from x as follows.

x -> weight_1 -> ReLU -> weight_2

H(x) is obtained from F(x) as follows.

F(x) + x -> ReLU 

So, I don't understand the second part of your question. The residual is F(x).

The authors hypothesize that the residual mapping (i.e. F(x)) may be easier to optimize than H(x). To illustrate with a simple example, assume that the ideal H(x) = x. Then for a direct mapping it would be difficult to learn an identity mapping as there is a stack of non-linear layers as follows.

x -> weight_1 -> ReLU -> weight_2 -> ReLU -> ... -> x

So, to approximate the identity mapping with all these weights and ReLUs in the middle would be difficult.

Now, if we define the desired mapping H(x) = F(x) + x, then we just need get F(x) = 0 as follows.

x -> weight_1 -> ReLU -> weight_2 -> ReLU -> ... -> 0  # look at the last 0

Achieving the above is easy. Just set any weight to zero and you will a zero output. Add back x and you get your desired mapping.

Other factor in the success of residual networks is uninterrupted gradient flow from the first layer to the last layer. That is out of scope for your question. You can read the paper: "identity mappings in deep residual networks" for more information on this.

Autonomous
  • 8,935
  • 1
  • 38
  • 77
  • 1
    Thanks for the answer. For simple H(x)=x, the residual is F(x) = H(x) - x, but i read the definition of "residual" in wikipedia. It makes me confused when H(x) is not a simple function, why we still can represent the residual as H(x) - x? – Johnnylin Apr 08 '17 at 15:43
  • The question is not why we can still "represent" the residual as `H(x) - x` because that is by construction. You learn `F(x)` and add `x` so `H(x) - x` automatically becomes the residual. The question here should be why learning `F(x)` may be easier than `H(x)`. I don't think there is any concrete answer as the authors themselves hypothesize this in the original paper. The success is more attributed to the fact that residual networks,by construction, allow gradients to flow uninterruptedly unlike their conventional, non-residual counterparts. – Autonomous Apr 08 '17 at 19:12