1

We have lots of domain-specific data (200M+ data points, each document having ~100 to ~500 words) and we wanted to have a domain-specific LM.

We took some sample data points (2M+) & fine-tuned RoBERTa-base (using HF-Transformer) using the Mask Language Modelling (MLM) task.

So far,

  1. we did 4-5 epochs (512 sequence length, batch-size=48)
  2. used cosine learning rate scheduler (2-3 cycles/epochs)
  3. We used dynamin masking (masked 15% tokens)

Since the RoBERTa model is finetuned on domain-specific data, we do expect this model to perform better than the pre-trained-RoBERTa which is trained on general texts (wiki data, books, etc)

We did perform some tasks like Named Entity Recognition (NER), Text Classification, and Embedding generation to perform cosine similarity tasks. We did this on both finetuned domain-specific RoBERTa and pre-trained-RoBERTa.

Surprisingly, the results are the same (very small difference) for both models. We did try Spacy models too, but the results are the same.

Perplexity scores indicate that finetuned MLM-based RoBERTa has a minimal loss.

  1. Can anyone please help us understand why MLM based model is NOT performing better?
  2. Should we go for more data OR more epochs OR both, to see some effect?
  3. are we doing anything wrong here? Let me know if any required details are missing. I will update

any suggestions OR any valuable links addressing these concerns would be really helpful

Huggingface discussion page: https://discuss.huggingface.co/t/fine-tuned-mlm-based-roberta-not-improving-performance/36913

Kalsi
  • 579
  • 5
  • 13
  • 2
    Try plotting the training curve, e.g. discuss.huggingface.co/t/plot-loss-curve-with-trainer/9767/10 – alvas Apr 22 '23 at 11:54

1 Answers1

2

Clarification for vocabulary use: fine-tuning refers to the period of training on task-specific and annotated data, for particular objectives that are not MLM. (Continuous) pre-training refers to any period of MLM training.

The problem largely depends on two factors, which you seem to ignore in your setting so far:

  1. Depending on the domain differences, the model will still utilize the same vocabulary representation for the words. In general, if you have highly domain-specific words, these will be split into many smaller subword units, which makes reasoning over sequences of subwords harder. To address the issue, unfortunately, you don't have much choice but to train a model completely from scratch, and initialize a vocabulary on your own data. For relevant literature, I recommend reading the SciBERT paper by Beltagy et al. They basically demonstrate that having a domain-specific vocabulary can greatly benefit downstream performance.
  2. Having done additional pre-training (fine-tuning with MLM objective) is not a key requirement for achieving better fine-tuning performance. In fact, the whole point of pre-training is to achieve high domain-specific fine-tuning performance without the need for large (unannotated) data, see the original BERT paper, for example.

Without knowing much about your fine-tuning datasets (i.e., the actually annotated sets for NER, classification or similarity search), it is quite difficult to tell how you can improve your results. In my experience, it is usually worth it to simply annotate more task-specific data, instead of wasting compute hours on additional pre-training.
I am aware that this is usually the costlier option (compute is quite cheap to get, human annotations not so much), but I have yet to see a project where more (high-quality) data didn't help.

To answer some of your concerns regarding the training performance: As previous commenters have pointed out, it should help you to investigate the loss curve (ideally using a constant/linear learning rate for better tractability); if you see a continuously decreasing loss on your validation set, then you may consider training for more epochs. IMHO, training for <5 epochs on sufficiently large datasets (more than 1,000, better ~10,000 instances) usually already achieves decent performance. I know there is little "hard evidence" to back up these claims, I will admit that they are purely empirical.

dennlinger
  • 9,890
  • 1
  • 42
  • 63