Can anyone please explain in a clear way what is the usage of mask in attention for sparse attention? I just can not get how masking tokens (I do not mean here pad tokens) can make attention faster as example as mentioned in sparse attention definition in GMAT under section Global-Memory Augmented Transformers . Does not softmax work on the same number of sequence length? so how is it faster?
Asked
Active
Viewed 71 times