sparse attention and its relation with attention mask

Asked Oct 15 '22 at 15:09

Active Oct 15 '22 at 15:09

Viewed 71 times

Can anyone please explain in a clear way what is the usage of mask in attention for sparse attention? I just can not get how masking tokens (I do not mean here pad tokens) can make attention faster as example as mentioned in sparse attention definition in GMAT under section Global-Memory Augmented Transformers . Does not softmax work on the same number of sequence length? so how is it faster?

asked Oct 15 '22 at 15:09

Arij Aladel

sparse attention and its relation with attention mask

0 Answers0