0

Since NLP tasks have variable-length data, we need to add paddings to make the same size with other inputs in a mini-batch. However, paddings became non-zero after passing normalization layers. This makes gradients for each padding, which makes unnecessary weight updates.

Have anyone ever seen any paper tried to solve this problem?

Reference:https://tunz.kr/post/4

I found that many other implementations are handling this issue (e.g. reset paddings to zero at every sublayer).

S.F. Chen
  • 13
  • 2

0 Answers0