Transformer's zero-padding became non-zero after passing normalization layers, making unnecessary weight updates

Asked Aug 13 '23 at 06:35

Active Aug 13 '23 at 06:35

Viewed 13 times

Since NLP tasks have variable-length data, we need to add paddings to make the same size with other inputs in a mini-batch. However, paddings became non-zero after passing normalization layers. This makes gradients for each padding, which makes unnecessary weight updates.

Have anyone ever seen any paper tried to solve this problem?

Reference:https://tunz.kr/post/4

I found that many other implementations are handling this issue (e.g. reset paddings to zero at every sublayer).

asked Aug 13 '23 at 06:35

S.F. Chen

Transformer's zero-padding became non-zero after passing normalization layers, making unnecessary weight updates

0 Answers0