11

I’m trying to train BERT model from scratch using my own dataset using HuggingFace library. I would like to train the model in a way that it has the exact architecture of the original BERT model.

In the original paper, it stated that: “BERT is trained on two tasks: predicting randomly masked tokens (MLM) and predicting whether two sentences follow each other (NSP). SCIBERT follows the same architecture as BERT but is instead pretrained on scientific text.”

I’m trying to understand how to train the model on two tasks as above. At the moment, I initialised the model as below:

from transformers import BertForMaskedLM
model = BertForMaskedLM(config=config)

However, it would just be for MLM and not NSP. How can I initialize and train the model with NSP as well or maybe my original approach was fine as it is?

My assumptions would be either

  1. Initialize with BertForPreTraining (for both MLM and NSP), OR

  2. After finish training with BertForMaskedLM, initalize the same model and train again with BertForNextSentencePrediction (but this approach’s computation and resources would cost twice…)

I’m not sure which one is the correct way. Any insights or advice would be greatly appreciated.

tlqn
  • 349
  • 1
  • 6
  • 18

2 Answers2

26

You can easily train BERT from scratch both on MLM & NSP tasks using combination of BertForPretraining TextDatasetForNextSentencePrediction DataCollatorForLanguageModeling and Trainer.

I wouldn't suggest you to first train your model MLM then NSP which might lead to catastrophic forgetting. It's basically forgetting what you've learnt from previous training.

  1. Load your pre-trained tokenizer.
from transformers import BertTokenizer
bert_cased_tokenizer = BertTokenizer.from_pretrained("/path/to/pre-trained/tokenizer/for/new/domain", do_lower_case=False)
  1. Initialize your model with BertForPretraining
from transformers import BertConfig, BertForPreTraining
config = BertConfig()
model = BertForPreTraining(config)
  1. Create dataset for NSP task. TextDatasetForNextSentencePrediction will tokenize and creates labels for sentences. Your dataset should in the following format: (or you could just modify the existing code)

(1) One sentence per line. These should ideally be actual sentences (2) Blank lines between documents

Sentence-1 From Document-1
Sentence-2 From Document-1
Sentence-3 From Document-1
...

Sentence-1 From Document-2
Sentence-2 From Document-2
Sentence-3 From Document-2
from transformers import TextDatasetForNextSentencePrediction
dataset = TextDatasetForNextSentencePrediction(
    tokenizer=bert_cased_tokenizer,
    file_path="/path/to/your/dataset",
    block_size = 256
)
  1. Use DataCollatorForLanguageModeling for masking and passing the labels that are generated from TextDatasetForNextSentencePrediction. DataCollatorForNextSentencePrediction has been removed, since it was doing the same thing with DataCollatorForLanguageModeling
from transformers import DataCollatorForLanguageModeling

data_collator = DataCollatorForLanguageModeling(
    tokenizer=bert_cased_tokenizer, 
    mlm=True,
    mlm_probability= 0.15
)
  1. Train & save

from transformers import Trainer, TrainingArguments

training_args = TrainingArguments(
    output_dir= "/path/to/output/dir/for/training/arguments"
    overwrite_output_dir=True,
    num_train_epochs=2,
    per_gpu_train_batch_size= 16,
    save_steps=10_000,
    save_total_limit=2,
    prediction_loss_only=True,
)

trainer = Trainer(
    model=model,
    args=training_args,
    data_collator=data_collator,
    train_dataset=dataset,
)

trainer.train()
trainer.save_model("path/to/your/model")
Khan9797
  • 600
  • 1
  • 4
  • 12
  • Hello @Khan9797 I'm trying out this code on Colab but I got an error: `RuntimeError: CUDA error: CUBLAS_STATUS_ALLOC_FAILED when calling "cublasCreate(handle)"`, any idea on how I could fix that? – SilentCloud Oct 12 '21 at 07:47
  • 1
    This is significantly more useful than the "official" HF site/blog/script. Thanks! – GrimSqueaker May 04 '22 at 11:58
  • You are welcome @GrimSqueaker. I had trouble with Huggingface docs too back then :( I hope it'll be better in the future – Khan9797 May 04 '22 at 23:02
  • @Khan9797, I am trying to use BERT for Next sentence prediction. Can your code be used for it? If yes, please also let me know how to use it to test as well. Thanks – not_yet_a_fds May 06 '22 at 16:01
  • @not_yes_a_fds the above can be used for training BERT from scratch with MLM + NSP. For Next Sentece Prediction you should use `BertForNextSentencePrediction` instead of `BertForPretraining`. Also since you want to fine-tune BERT on NextSentencePrediction, you should also use a pre-trained BERT model – Khan9797 May 09 '22 at 06:06
  • @Khan9797 I have Done very similar to what you have described, The problem is the accuracy of MLM task is 0.3% and the accuracy of NSP task is 65%.I did not externally specified a loss function, instead iam using the internal loss function """No loss specified in compile() - the model's internal loss computation will be used as the loss. Don't panic - this is a common way to train TensorFlow models in Transformers! To disable this behaviour please pass a loss argument, or explicitly pass `loss=None` if you do not want to compute a loss function""" Any possible reason for the behaviour? – Bella_18 May 31 '23 at 13:39
  • @Bella_18 I used PyTorch with Hugginface and probably a different version of the Huggingface Transformers library. So I can't be sure why you have received this error. But it looks like it's just a warning if you forget to pass loss function because it says "Don't panic - this is a common way to train TensorFlow models in Transformers!". There are lots of reasons why mlm accuracy is so low. Did you train your own tokenizer ? You can check tokenized version of input. If tokenizer puts [UNK] it means that your token is not defined in the vocabulary. And model can't predict it. – Khan9797 Jun 01 '23 at 06:51
11

I would suggest doing the following:

  1. First pre-train BERT on the MLM objective. HuggingFace provides a script especially for training BERT on the MLM objective on your own data. You can find it here. As you can see in the run_mlm.py script, they use AutoModelForMaskedLM, and you can specify any architecture you want.

  2. Second, if want to train on the next sentence prediction task, you can define a BertForPretraining model (which has both the MLM and NSP heads on top), then load in the weights from the model you trained in step 1, and then further pre-train it on a next sentence prediction task.

UPDATE: apparently the next sentence prediction task did help improve performance of BERT on some GLUE tasks. See this talk by the author of BERT.

Niels
  • 1,023
  • 1
  • 16
  • 13
  • I have a quick follow-up question on this. For your listed number2, I would need labeled data to train on NSP task, is that correct? For example, which sentence is A and which is B and which one follows the other? – tlqn Feb 09 '21 at 18:39
  • Yes, although labeling in that case is trivial. You can simply crawl a lot of pages from the web, and create both pairs of sentences that really followed each other in a document (label these as 1) and pairs of random sentences (labeled as 0). See [here](https://github.com/google-research/bert/blob/eedf5716ce1268e56f0a50264a88cafad334ac61/create_pretraining_data.py#L223) how the author of BERT did this (from the official BERT repo). – Niels Feb 10 '21 at 14:08
  • The link you provided in the bullet point #1 is not working – SilentCloud Sep 28 '21 at 12:07
  • The language-modelling notebook available at https://github.com/huggingface/transformers/tree/master/examples/pytorch uses a model other than Bert but gives the solution for labelling the dataset. – Mahima Dec 27 '21 at 06:45
  • @Niels I have implemented your step 2, The problem is the accuracy of MLM task is 0.3% and the accuracy of NSP task is 65%.I did not externally specified a loss function, instead iam using the internal loss function """No loss specified in compile() - the model's internal loss computation will be used as the loss. Don't panic - this is a common way to train TensorFlow models in Transformers! To disable this behaviour please pass a loss argument, or explicitly pass loss=None if you do not want to compute a loss function""" Any possible reason for the behaviour? – Bella_18 May 31 '23 at 13:47