The objective of transferred learning using pre-trained models partially answers your questions. BertModel
pre-trained on large corpus, which when adapted to task specific corpus, usually performs better than non pre-trained models (for example, training a simple LSTM for classification task).
BERT has shown that it performs well when fine-tuned on small task-specific corpus. (This answers your question 2.). However, the level of improvements also depend on the domain and task that you want to perform, and how related was the data used for pre-training is with respect to your target dataset.
From my experience, pre-trained BERT when fine-tuned on target task performs much better than other DNNs such as LSTM and CNNs when the datasets are highly imbalanced. However, this again depends on the task and data. 1:99 is really a huge imbalance, which might require data balancing techniques.