1

This is my first question on StackOverflow. I am working on the CUAD(Contract Understanding Atticus Dataset) which is a Q&A based dataset. But training 80% of the dataset in one go is impossible due to resource constraints. I am using the boilerplate code provided by HuggingFace Transformer docs for Q&A task here. My hands are tied with Google Colab Pro. So, it's not possible for me to use multiple GPU's in training the dataset. Inspite of using the hyperparameters below, I'm unable to avoid errors due to memory constraints like "CUDA out of Memory" etc.

args = TrainingArguments(
    'cuad-roberta',
    evaluation_strategy = "epoch",
    learning_rate=3e-5,
    per_device_train_batch_size=2,
    per_device_eval_batch_size=2,
    num_train_epochs=2,
    weight_decay=0.01,
    save_steps=5000,
    logging_steps=5000,
    save_total_limit=100,
    gradient_accumulation_steps = 12,
    eval_accumulation_steps = 4,
)

Under these circumstances, I have divided my training set(80%) into 4 parts with each part holding 25% data. So, using any Q&A supported pretrained model from Transformers, I've trained the first 25% of the training data and then saved the model in a directory of my drive. Then, I have loaded that tokenizer and model from the saved directory and trained the next 25% of my training data on the same model as shown below.

tokenizer = AutoTokenizer.from_pretrained('/content/drive/MyDrive/models/cuad-25%-roberta-base')
model = AutoModelForQuestionAnswering.from_pretrained('/content/drive/MyDrive/models/cuad-25%-roberta-base')

I repeated the step two more times to complete training the model on the entire training data.

Now, my question is that, Is this approach correct in terms of training a model when I have resource constraints? If it is correct, will this approach hurt the performance of my model? I'm relatively new to ML and NLP so please kindly consider any silly mistakes.

Also, any sources for understanding, visualising or implementing the Q&A task through HuggingFace Transformers would be really helpful.

Md Rakib
  • 41
  • 3
  • It is not like that you lose 10% performance or anything like that when you do it that way, but I would consider training another epoch or two with different 25% slices of your dataset (i.e. shuffle it). Regarding the ressources, just check their [example repository](https://github.com/huggingface/transformers/tree/master/examples/pytorch/question-answering) and the [documentation](https://huggingface.co/transformers/v4.5.1/custom_datasets.html#question-answering-with-squad-2-0). – cronoik May 12 '21 at 22:15

0 Answers0