4

Recently I was going through Attention is all you need paper, ongoing through it I found an issue regarding understanding the attention network if I ignore the maths behind it. Can anyone make me understand the attention network with an example?

Kumar Mangalam
  • 748
  • 7
  • 12

1 Answers1

3

This tutorial illustrates each core component in Transformer and definitely worth reading.

Intuitively, the attention mechanisms are trying to find the "similar" timestep according to an attention function (e.g. projection + cosine similarity in Attention is all you need), then compute the new representation with the accordingly calculated weight and previous representations.

Crystina
  • 990
  • 1
  • 5
  • 16