What makes BertGeneration and/or RobertaForCausalLM causal models? Where does the causal attention masking happen?

Question

I am trying to use RobertaForCausalLM and/or BertGeneration for causal language modelling / next-word-prediction / left-to-right prediction. I can't seem to figure out where the causal masking is happening? I want to train teacher forcing with the ground-truth labels, but no information from future tokens to be included in the attention mechanism. For that I thought the model would need causal attention masking, but I don't see it being applied anywhere...

If anyone could point me to where this might be happening or why it is unnecessary, that would be helpful.

Thanks!

score 1 · Answer 1 · edited Oct 17 '22 at 19:34

1

I have found it. It happens in get_extended_attention_mask in modeling utils. Consider this question solved :slight_smile:

edited Oct 17 '22 at 19:34

Jacob Stern

3,758
3
32
54

answered Oct 27 '20 at 06:50

Claartje Barkhof

41
3

What makes BertGeneration and/or RobertaForCausalLM causal models? Where does the causal attention masking happen?

1 Answers1