1

I am trying to use RobertaForCausalLM and/or BertGeneration for causal language modelling / next-word-prediction / left-to-right prediction. I can't seem to figure out where the causal masking is happening? I want to train teacher forcing with the ground-truth labels, but no information from future tokens to be included in the attention mechanism. For that I thought the model would need causal attention masking, but I don't see it being applied anywhere...

If anyone could point me to where this might be happening or why it is unnecessary, that would be helpful.

Thanks!

cronoik
  • 15,434
  • 3
  • 40
  • 78

1 Answers1

1

I have found it. It happens in get_extended_attention_mask in modeling utils. Consider this question solved :slight_smile:

Jacob Stern
  • 3,758
  • 3
  • 32
  • 54