Recently I was going through Attention is all you need paper, ongoing through it I found an issue regarding understanding the attention network if I ignore the maths behind it. Can anyone make me understand the attention network with an example?
Asked
Active
Viewed 100 times
1 Answers
3
This tutorial illustrates each core component in Transformer and definitely worth reading.
Intuitively, the attention mechanisms are trying to find the "similar" timestep according to an attention function (e.g. projection + cosine similarity in Attention is all you need), then compute the new representation with the accordingly calculated weight and previous representations.

Crystina
- 990
- 1
- 5
- 16