I am trying to use RobertaForCausalLM and/or BertGeneration for causal language modelling / next-word-prediction / left-to-right prediction. I can't seem to figure out where the causal masking is happening? I want to train teacher forcing with the ground-truth labels, but no information from future tokens to be included in the attention mechanism. For that I thought the model would need causal attention masking, but I don't see it being applied anywhere...
If anyone could point me to where this might be happening or why it is unnecessary, that would be helpful.
Thanks!