I’m trying to train BERT model from scratch using my own dataset using HuggingFace library. I would like to train the model in a way that it has the exact architecture of the original BERT model.
In the original paper, it stated that: “BERT is trained on two tasks: predicting randomly masked tokens (MLM) and predicting whether two sentences follow each other (NSP). SCIBERT follows the same architecture as BERT but is instead pretrained on scientific text.”
I’m trying to understand how to train the model on two tasks as above. At the moment, I initialised the model as below:
from transformers import BertForMaskedLM
model = BertForMaskedLM(config=config)
However, it would just be for MLM and not NSP. How can I initialize and train the model with NSP as well or maybe my original approach was fine as it is?
My assumptions would be either
Initialize with
BertForPreTraining
(for both MLM and NSP), ORAfter finish training with
BertForMaskedLM
, initalize the same model and train again withBertForNextSentencePrediction
(but this approach’s computation and resources would cost twice…)
I’m not sure which one is the correct way. Any insights or advice would be greatly appreciated.