I've recently been learning about self-attention transformers and the "Attention is All You Need" paper. When describing the architecture of the neural network used in the paper, one breakdown of the paper included this explanation for residual connections:
"Residual layer connections are used (of course) in both encoder and decoder blocks" (origin: https://www.kaggle.com/code/residentmario/transformer-architecture-self-attention/notebook)
This was, unfortunately, not obvious to me. What is the purpose of residual connections, and why should this be standard practice?