1

The GPT series models use the decoder of Transformer, with unidirectional attention. In the source code of GPT in Hugging Face, there is the implementation of masked attention:

self.register_buffer(
            "bias",
            torch.tril(torch.ones((max_positions, max_positions), dtype=torch.uint8)).view(
                1, 1, max_positions, max_positions
            ),
        )

The default attention_mask is None.

However, I have found that in some GPT demos, the attention_mask derived from valid lengths is not assigned. It seems that the padding tokens are not masked during attention, but just ignored in the loss computation.

Is it correct? Or whether masking the padding in the attention does not matter to the final results?

Besides, I also wonder whether the embedding of padding token will change during training.

  • Hard to tell without seeing the actual code. In general, you can ignore certain indexes with [nn.CrossEntropyLoss](https://pytorch.org/docs/stable/generated/torch.nn.CrossEntropyLoss.html) and its parameter `ignore_index`. You also need to keep in mind the left-to-right attention of GPT, which means the padding tokens do not interfere with your text tokens. So in principle, it is not wrong but as I said, it depends on the actual implementation and task. – cronoik Apr 01 '23 at 20:59
  • 1
    Thanks for your answer! I made a mistake yesterday, now I understand. – LocustNymph Apr 02 '23 at 04:29

0 Answers0