How to train a LM model with whole word masking using Pytorch Trainer API

Question

I am thinking of fine tuning model by training Language Model from scratch. I have couple of basic questions related to this:

I wanted to use whole-word-masking in training LM from scratch. I could not have found how to apply this option using Trainer.

Here is my data-set and code:

text=['I am huggingface fan', 'I love huggingface', ....]
data_collator = DataCollatorForLanguageModeling(tokenizer=tokenizer, mlm=True, mlm_probability=0.15)

trainer = tr.Trainer(
    model=model,
    args=training_args,
    data_collator=data_collator,
    train_dataset=train_data
)

trainer.train()

But it doesn’t take into account whole word masking.

How can I use this function to train LM on whole word masking using Pytorch Trainer?

How can I train on larger sequences which are greater than models max-length using Pytorch Trainer?

Perhaps this collator: `DataCollatorForWholeWordMask`: https://huggingface.co/docs/transformers/v4.20.1/en/main_classes/data_collator#transformers.DataCollatorForWholeWordMask — Mehdi, Jul 08 '22 at 11:21
increasing max-length is often a matter of hardware limitations. — Mehdi, Jul 08 '22 at 11:22

score 0 · Answer 1 · answered Jul 29 '22 at 08:15

Using the trainer you need to implement your own data collator For example https://discuss.huggingface.co/t/how-to-use-whole-word-masking-data-collator/15778

for the second variant: all transformers are resilient to the sequence length especially if they are using relative positional encoding such as t5 and longt5, if they are using sinusoidal positional encoding as traditional transformer, then they can generalized to some more length and you can increase your sequence length as your machine does not produce OOM(out of memory) error, the best variant is to use sparse attention that is used in longt5 or longformer for example.

How to train a LM model with whole word masking using Pytorch Trainer API

1 Answers1