1

I want some help regarding adding additional words in the existing BERT model. I have two quires kindly guide me:

I am working on NER task for a domain:

There are few words (not sure the exact numbers) that BERT recognized as [UNK], but those entities are required for the model to recognize. The pretrained model learns well (up to 80%) accuracy on "bert-base-cased" while providing labeled data and fine-tune the model but intuitively the model will learn better if it recognize all the entities.

  1. Do i need to add those unknown entities in vocabs.txt and train the model again?

  2. Do i need to train the BERT model on my data from Scratch?

Thanks...

muzamil
  • 196
  • 7
  • excuse me did you find the solution for this problem – user5520049 Jan 16 '22 at 15:57
  • @user5520049 Yes, I solved that problem by pre-training the BERT model on my domain dataset and train the domain-adopted BERT model for downstream NER task. Here is the github link: https://github.com/geo47/MenuNER – muzamil Jan 19 '22 at 05:49

1 Answers1

1

BERT works well because it is pre-trained on a very large textual dataset of 3.3 billion words. Training BERT from skratch is resource-demanding and does not pay of in most reasonable use cases.

BERT uses the wordpiece algorithm for input segmentation. This shoudl in theory ensure that there no out-of-vocabulary token that would end up as [UNK]. The worst-case scenario in the segmentation would be that input tokens end up segmented into individual characters. If the segmentation is done correctly, [UNK] should appear only if the tokenizer encouters UTF-8 that were not in the training data.

The most probably sources of your problem:

  1. There is a bug in the tokenization, so it produces tokens that are not in the word-piece vocabulary. (Perhaps word tokenization instead of WordPiece tokenization?)

  2. It is an encoding issue that generates invalid or weird UTF-8 characters.

Jindřich
  • 10,270
  • 2
  • 23
  • 44
  • Thanks for the answer! quoting this "The worst-case scenario in the segmentation would be that input tokens end up segmented into individual characters. " Wouldn't it effect the end result in predicting entities? Is that possible use pretrained BERT and further train the model on domain specific raw text data? And how do you think about adding new words in vocab.txt file? Thanks – muzamil Nov 13 '20 at 12:38
  • Oversgemented entities should not be a problem for the model. Continued pre-training on domain-specif data can help. Adding words to vocab.txt is not an option, it would mean adding embeddings for the new symbols and re-training the model with new parameters. – Jindřich Nov 13 '20 at 12:43
  • Thanks again :-) Can you please explain a bit about the idea of "Continued pre-training on domain-specif data can help." By this, do you mean the annotated data? – muzamil Nov 13 '20 at 12:47