The GPT series models use the decoder of Transformer, with unidirectional attention. In the source code of GPT in Hugging Face, there is the implementation of masked attention:
self.register_buffer(
"bias",
torch.tril(torch.ones((max_positions, max_positions), dtype=torch.uint8)).view(
1, 1, max_positions, max_positions
),
)
The default attention_mask is None.
However, I have found that in some GPT demos, the attention_mask derived from valid lengths is not assigned. It seems that the padding tokens are not masked during attention, but just ignored in the loss computation.
Is it correct? Or whether masking the padding in the attention does not matter to the final results?
Besides, I also wonder whether the embedding of padding token will change during training.