-2

I've recently been learning about self-attention transformers and the "Attention is All You Need" paper. When describing the architecture of the neural network used in the paper, one breakdown of the paper included this explanation for residual connections:

"Residual layer connections are used (of course) in both encoder and decoder blocks" (origin: https://www.kaggle.com/code/residentmario/transformer-architecture-self-attention/notebook)

This was, unfortunately, not obvious to me. What is the purpose of residual connections, and why should this be standard practice?

1 Answers1

1

There is nothing "obvious" about skip connections, it is something that as a community we learned the hard way. The basic premise is that in neural network parametrisation of feed forward layers, it is surprisingly hard to learn identify function. Skip connections make this special function (f(x)=x) extremely easy to learn, which improves network learning stability, and overall performance in a wide range of applications, at pretty much no extra computational cost. You are essentially giving a network an easy way of not using convoluted, comlpex part of computation when it does not need to, and thus allow us to use complex and big architectures without in depth understanding of the dynamics of the problem (which are beyond our current understanding of math!).

You can look at old-ish papers like highway networks showing how it allows to train very deep models that otherwise would be to ill-conditioned to trian.

lejlot
  • 64,777
  • 8
  • 131
  • 164