3

I'm trying to use hugging face's BERT-base-uncased model to train on emoji prediction on tweets, and it seems that after the first epoch, the model immediately starts to overfit. I have tried the following:

  1. Increasing the training data (I increased this from 1x to 10x with no effect)
  2. Changing the learning rate (no differences there)
  3. Using different models from hugging face (the results were the same again)
  4. Changing the batch size (went from 32, 72, 128, 256, 512, 1024)
  5. Creating a model from scratch, but I ran into issues and decided to post here first to see if I was missing anything obvious.

At this point, I'm concerned that the individual tweets don't give enough information for the model to make a good guess, but wouldn't it be random in that case, rather than overfitting?

Also, training time seems to be ~4.5 hours on Colab's free GPUs, is there any way to speed that up? I tried their TPU, but it doesn't seem to be recognized.

This is what the data looks like

dataset screenshot

And this is my code below:

import pandas as pd
import json
import re
from transformers import AutoTokenizer, AutoModelForSequenceClassification
from sklearn.model_selection import train_test_split
import torch
from transformers import TrainingArguments, Trainer
from transformers import EarlyStoppingCallback
from sklearn.metrics import accuracy_score,precision_score, recall_score, f1_score
import numpy as np

# opening up the data and removing all symbols
df = pd.read_json('/content/drive/MyDrive/computed_results.json.bz2')
df['text_no_emoji'] = df['text_no_emoji'].apply(lambda text: re.sub(r'[^\w\s]', '', text))


# loading the tokenizer and the model from huggingface
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
model = AutoModelForSequenceClassification.from_pretrained("bert-base-uncased", num_labels=5).to('cuda')

# test train split
train, test = train_test_split(df[['text_no_emoji', 'emoji_codes']].sample(frac=1), test_size=0.2)

# defining a dataset class that generates the encoder and labels on the fly to minimize memory usage
class Dataset(torch.utils.data.Dataset):    
    def __init__(self, input, labels=None):
        self.input = input
        self.labels = labels

    def __getitem__(self, pos):
        encoded = tokenizer(self.input[pos], truncation=True, max_length=15, padding='max_length')
        label = self.labels[pos]
        ret = {key: torch.tensor(val) for key, val in encoded.items()}

        ret['labels'] = torch.tensor(label)
        return ret

    def __len__(self):
        return len(self.labels)

# training and validation datasets are defined here
train_dataset = Dataset(train['text_no_emoji'].tolist(), train['emoji_codes'].tolist())
val_dataset = Dataset(train['text_no_emoji'].tolist(), test['emoji_codes'].tolist())

# defining the training arguments
args = TrainingArguments(
    output_dir="output",
    evaluation_strategy="epoch",
    logging_steps = 10,
    per_device_train_batch_size=1024,
    per_device_eval_batch_size=1024,
    num_train_epochs=5,
    save_steps=3000,
    seed=0,
    load_best_model_at_end=True,
    weight_decay=0.2,
)

# defining the model trainer
trainer = Trainer(
    model=model,
    args=args,
    train_dataset=train_dataset,
    eval_dataset=val_dataset
)

# Training the model
trainer.train()

Results: After this, the training generally stops pretty fast due to the early stopper

The dataset can be found here (39 Mb compressed)

Results from 3 epochs

Ali Abbas
  • 136
  • 1
  • 8
  • Which learning rates have you used? How do you know that it overfitted? – cronoik Jun 15 '21 at 23:48
  • For the learning rates, I used the default, then either increased or decreased it by an order of magnitude. As for overfitting, I've edited the question to show you the results (I was waiting for the latest version to be crunched) – Ali Abbas Jun 15 '21 at 23:50
  • Is that dataset publicly avaiable? – cronoik Jun 16 '21 at 00:02
  • Yep. It's from the internet archive, with a ton of post processing to make things smoother. Checkout the Social Media Public Analysis on github, where we're working on doing amazing things with years of Twitter data! – Ali Abbas Jun 16 '21 at 00:04
  • I can also upload a sample of the data somewhere if you want – Ali Abbas Jun 16 '21 at 00:15
  • Yes please upload a sample and also run your code with the sample again with the sample and add the losses to your question. Might take some time until I can have a look. – cronoik Jun 16 '21 at 00:40
  • Hey, sorry for the delay. It was 4 am and I had to sleep. I added the losses, and I added the dataset that I was using (the full one, but taking a sample should be easy, since it's relatively small) – Ali Abbas Jun 16 '21 at 08:38
  • Your training is significantly influenced with the weigh decay (regularization). Try removing it (atleast to begin with), – Ashwin Geet D'Sa Jun 17 '21 at 12:23
  • I did some more research and found that it's a common issue with hugging face unless you have a large amount of data, and I think that's increased further if your dataset is limited in data (like with tweets). I'm going to keep this open until I have a solution, but if @cronoik gets something (they're my only hope!), or Ashwin's solution works – Ali Abbas Jun 17 '21 at 19:28

0 Answers0