According to
Attention is all you need
paper: Additive attention (The classic attention use in RNN by Bahdanau) computes the compatibility function using a feed-forward network with a single hidden layer. While the two are similar in theoretical complexity, ...
Indeed, we can see here that the computational complexity of additive attention and dot-prod (transformer attention) are both n²*d
.
However, if we look closer at additive attention, it is in fact a RNN cell which have a computational complexity of n*d²
(according to the same table).
Thus, shouldn't the computational complexity of additive attention be n*d²
instead of n²*d
?