Fine tuning of Bert word embeddings

Question

I would like to load a pre-trained Bert model and to fine-tune it and particularly the word embeddings of the model using a custom dataset. The task is to use the word embeddings of chosen words for further analysis. It is important to mention that the dataset consists of tweets and there are no labels. Therefore, I used the BertForMaskedLM model.

Is it OK for this task to use the input ids (the tokenized tweets) as the labels? I have no labels. There are just tweets in randomized order.

From this point, I present the code I wrote:

First, I cleaned the dataset from emojis, non-ASCII characters, etc as described in the following link (2.3 Section): https://www.kaggle.com/jaskaransingh/bert-fine-tuning-with-pytorch

Second, the code of the fine tuning process:

import torch

device = torch.device('cuda') if torch.cuda.is_available() else torch.device('cpu')
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')

model = BertForMaskedLM.from_pretrained('bert-base-uncased')

model.to(device)
model.train()

lr = 1e-2

optimizer = AdamW(model.parameters(), lr=lr, correct_bias=False)
max_len = 82
chunk_size = 20
epochs = 20

for epoch in range(epochs):
    epoch_losses = []
    for j, batch in enumerate(pd.read_csv(path + file_name, chunksize=chunk_size)):
        tweets = batch['content_cleaned'].tolist()
    
        encoded_dict = tokenizer.batch_encode_plus(
                            tweets,                      # Sentence to encode.
                            add_special_tokens = True, # Add '[CLS]' and '[SEP]'
                            max_length = max_len,           # Pad & truncate all sentences.
                            pad_to_max_length = True,
                            truncation=True,
                            return_attention_mask = True,   # Construct attn. masks.
                            return_tensors = 'pt',     # Return pytorch tensors.
                       )
        input_ids = encoded_dict['input_ids'].to(device)
        
        # Is it correct? or should I train it in another way?
        loss, _ = model(input_ids, labels=input_ids)
        loss_score = loss.item()

        loss.backward()
        torch.nn.utils.clip_grad_norm_(model.parameters(), max_grad_norm)
        optimizer.step()
        optimizer.zero_grad()
model.save_pretrained(path + "Fine_Tuned_BertForMaskedLM")

The loss starts from 50 and reduced until 2.3.

What do you mean by finetuning? Finetuning usually means training for a specific task. Do you actually mean continue pre-training on a different dataset? What is the ultimate goal of your model adaptation? — Jindřich, Oct 01 '20 at 07:58
I would like to utilize the word vectors before and after the fine-tuning. Assume that I would like to gain two language models: one that is the basic Bert language model and the second is the model that also was trained on the specific dataset. Next, I will take the word vectors of specific words from both models and compare between them using bias techniques. — Aviade, Oct 01 '20 at 09:10

score 7 · Answer 1 · answered Oct 01 '20 at 11:55

Since the objective of the masked language model is to predict the masked token, the label and the inputs are the same. So, whatever you have written is correct.

However, I would like to add on the concept of comparing word embeddings. Since, BERT is not a word embeddings model, it is contextual, in the sense, that the same word can have different embeddings in different context. Example: the word 'talk' will have a different embeddings in the sentences "I want to talk" and "I will attend a talk". So, there is no single vector of embeddings for each word. (Which makes BERT different from word2vec or fastText). Masked Language Model (MLM) on a pre-trained BERT is usually performed when you have a small new corpus, and want your BERT model to adapt to it. However, I am not sure on the performance gain that you would get by using MLM and then fine-tuning to a specific task than directly fine-tuning the pre-trained model with task specific corpus on a downstream task.

Fine tuning of Bert word embeddings

1 Answers1