2

I have a large corpus, no labels. I trained this corpus to get my BERT tokenizer.

Then I want to build a BertModel to do a binary classification on a labeled dataset. However, this dataset is highly imbalanced, 1: 99. So my question is:

  1. Does BertModel would perform well on imbalanced dataset?
  2. Does BertModel would perform well on small dataset? (as small as less than 500 data points, I bet it's not..)

1 Answers1

2

The objective of transferred learning using pre-trained models partially answers your questions. BertModel pre-trained on large corpus, which when adapted to task specific corpus, usually performs better than non pre-trained models (for example, training a simple LSTM for classification task).

BERT has shown that it performs well when fine-tuned on small task-specific corpus. (This answers your question 2.). However, the level of improvements also depend on the domain and task that you want to perform, and how related was the data used for pre-training is with respect to your target dataset.

From my experience, pre-trained BERT when fine-tuned on target task performs much better than other DNNs such as LSTM and CNNs when the datasets are highly imbalanced. However, this again depends on the task and data. 1:99 is really a huge imbalance, which might require data balancing techniques.

Ashwin Geet D'Sa
  • 6,346
  • 2
  • 31
  • 59