3

My question here is no how to add new tokens, or how to train using a domain-specific corpus, I'm already doing that.

The thing is, am I supposed to add the domain-specific tokens before the MLM training, or I just let Bert figure out the context? If I choose to not include the tokens, am I going to get a poor task-specific model like NER?

To give you more background of my situation, I'm training a Bert model on medical text using Portuguese language, so, deceased names, drug names, and other stuff are present in my corpus, but I'm not sure I have to add those tokens before the training.

I saw this one: Using Pretrained BERT model to add additional words that are not recognized by the model

But the doubts remain, as other sources say otherwise.

Thanks in advance.

cronoik
  • 15,434
  • 3
  • 40
  • 78
rdemorais
  • 243
  • 3
  • 11
  • https://github.com/huggingface/tokenizers/issues/507#issuecomment-722705812 – Ashwin Geet D'Sa Apr 13 '21 at 22:22
  • Are you training a BERT model from scratch or are you finetuning a Portuguese language model for your medical dataset? Have you checked how many of these words will split? – cronoik Apr 14 '21 at 09:52
  • Hello @cronoik, I'm fine-tuning BERT model to a specific domain. The new tokens set is about 1000 big. I found that number by checking the TF-IDF and taking the most significant words. That was done to MLM training. – rdemorais Apr 14 '21 at 16:35
  • @AshwinGeetD'Sa thanks for the comment. But it is not what I'm looking for. I know how to add new tokens, but I'm not sure if I have to. – rdemorais Apr 14 '21 at 16:36
  • 1
    Training 1000 new token representations during finetuning a downstream task feels (!) too much for me for decent results. Does your downstream task allow working with placeholders? E.g. replace all medicine with new token `[MEDICINE]`, all chemical equations with new token `[CHEMICALEQUATION]`, and so on -> BERT -> replace the placeholders with the original strings again? If not, I would try to finetune for 2 epochs without adding additional tokens on a small subset and see if the result is already satisfying. – cronoik Apr 14 '21 at 19:31
  • Actually the fine-tuning task is MLM, after that I'm going to use it for NER and others. First of all I thought it would be useful to let the base model know how the new text presents itself, and I've got good results from it. But, eventually I realize that I was training the MLM without adding new tokens. Thats when I ask myself if I add them, it would be good in the end. I'm afraid that the only way to know for sure is trying. It is not possible to mask names, the corpus is huge. But I liked the ideia. – rdemorais Apr 15 '21 at 00:17

1 Answers1

3

Yes, you have to add them to the models vocabulary.

tokenizer = BertTokenizer.from_pretrained(model_name)
tokenizer.add_tokens(['new', 'rdemorais', 'blabla'])
model = Bert.from_pretrained(model_name, return_dict=False)
     
model.resize_token_embeddings(len(tokenizer))

The last line is important and needed since you change the numbers of tokens in the model's vocabulary, you also need to update the model correspondingly.

Berkay Berabi
  • 1,933
  • 1
  • 10
  • 26
  • hello @Berkay Berabi thank you for replying. – rdemorais Apr 17 '21 at 15:51
  • 1
    It is useful to point out that the tokens added to the tokenizer must be in *lower case*, because otherwise they will not be recognized (regardless of the case with which they will appear in a text to be tokenized). – Piercarlo Slavazza Jan 19 '22 at 08:52
  • So I add new tokens first and only then train the model with MLM? – ruslaniv Oct 30 '22 at 14:00
  • yes, or you can try to find another pretrained model that already contains the tokens you want to add – Berkay Berabi Oct 31 '22 at 08:45
  • But should not you actually train the tokenizer from scratch on your own corpus rather than just adding new tokens into the existing tokenizer? – ruslaniv Dec 03 '22 at 04:44