I would like to load a pre-trained Bert model and to fine-tune it and particularly the word embeddings of the model using a custom dataset. The task is to use the word embeddings of chosen words for further analysis. It is important to mention that the dataset consists of tweets and there are no labels. Therefore, I used the BertForMaskedLM model.
Is it OK for this task to use the input ids (the tokenized tweets) as the labels? I have no labels. There are just tweets in randomized order.
From this point, I present the code I wrote:
First, I cleaned the dataset from emojis, non-ASCII characters, etc as described in the following link (2.3 Section): https://www.kaggle.com/jaskaransingh/bert-fine-tuning-with-pytorch
Second, the code of the fine tuning process:
import torch
device = torch.device('cuda') if torch.cuda.is_available() else torch.device('cpu')
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = BertForMaskedLM.from_pretrained('bert-base-uncased')
model.to(device)
model.train()
lr = 1e-2
optimizer = AdamW(model.parameters(), lr=lr, correct_bias=False)
max_len = 82
chunk_size = 20
epochs = 20
for epoch in range(epochs):
epoch_losses = []
for j, batch in enumerate(pd.read_csv(path + file_name, chunksize=chunk_size)):
tweets = batch['content_cleaned'].tolist()
encoded_dict = tokenizer.batch_encode_plus(
tweets, # Sentence to encode.
add_special_tokens = True, # Add '[CLS]' and '[SEP]'
max_length = max_len, # Pad & truncate all sentences.
pad_to_max_length = True,
truncation=True,
return_attention_mask = True, # Construct attn. masks.
return_tensors = 'pt', # Return pytorch tensors.
)
input_ids = encoded_dict['input_ids'].to(device)
# Is it correct? or should I train it in another way?
loss, _ = model(input_ids, labels=input_ids)
loss_score = loss.item()
loss.backward()
torch.nn.utils.clip_grad_norm_(model.parameters(), max_grad_norm)
optimizer.step()
optimizer.zero_grad()
model.save_pretrained(path + "Fine_Tuned_BertForMaskedLM")
The loss starts from 50 and reduced until 2.3.