Using WordPiece tokenization with RoBERTa

Question

As far as I understood, the RoBERTa model implemented by the huggingface library, uses BPE tokenizer. Here is the link for the documentation:

RoBERTa has the same architecture as BERT, but uses a byte-level BPE as a tokenizer (same as GPT-2) and uses a different pretraining scheme.

However, I have a custom tokenizer based on WordPiece tokenization and I used the BertTokenizer.

Because my customized tokenizer is much more relevant for my task, I prefer not to use BPE.

When I pre-trained the RoBERTa from scratch (RobertaForMaskedLM) with my custom tokenizer the loss for the MLM task was much better than the loss with BPE. However, when it comes to fine-tuning, the model (RobertaForSequenceClassification) perform poorly. I am almost sure the problem is not about the tokenizer. I wonder if the huggingface library for the RobertaForSequenceClassification is not compatible with my tokenizer.

Details about the fine-tuning:

task: multilabel classification with imbalanced labels.

epochs: 20

loss: BCEWithLogitsLoss()

optimizer: Adam, weight_decay_rate:0.01, lr: 2e-5, correct_bias: True

The F1 and AUC was very low because the output probabilities for the labels was not in accordance with the actual labels (even with a very low threshold) which means the model couldn't learn anything.

*

Note: The pre-trained and fine-tuned RoBERTa with BPE tokenizer performs better than the pre-trained and fine-tuned with custom tokenizer although the loss for MLM with custom tokenizer was better than BPE.

Does that mean you have trained a whole RoBERTa by yourself without using any pretrained weights like `roberta-base`? — cronoik, Dec 11 '20 at 09:50
I also do not think that this is caused by the tokenizer. It would be great if you could add some information about the finetuning (learning rate epochs, training size, accuracy..). Please add this information directly to your question. I have already removed some of my comments that are not longer relevant. — cronoik, Dec 11 '20 at 10:02
Furthermore, it could be interesting to know the size of the data you used to train your model from scratch. — cronoik, Dec 11 '20 at 10:07
Also, can you please share part of your code used to do training and fine-tuning. (Just to ensure that you are fine-tuning on the appropriate model) — Ashwin Geet D'Sa, Dec 11 '20 at 16:26

Using WordPiece tokenization with RoBERTa

0 Answers0