Since NLP tasks have variable-length data, we need to add paddings to make the same size with other inputs in a mini-batch. However, paddings became non-zero after passing normalization layers. This makes gradients for each padding, which makes unnecessary weight updates.
Have anyone ever seen any paper tried to solve this problem?
Reference:https://tunz.kr/post/4
I found that many other implementations are handling this issue (e.g. reset paddings to zero at every sublayer).