I am training a sparse multi-label text classification
problem using Hugging Face
models which is one part of SMART REPLY System
. The task which I am doing is mentioned below:
I classify Customer Utterances
as input to the model and classify to which Agent Response
clusters it belongs to. I have 60
clusters and Customer Utterances
can map to one or more clusters.
Input to Model
Input Output
My account is blocked [0,0,0,1,1,0....0,0,0,0,0]
The Output is Encoding Vector for Cluster labels. In the above example the customer query maps into cluster 4
and cluster 5
of agent responses.
Problem:
The model always predict the cluster numbers which are very frequent. It doesn't take the rare clusters.
Only few 1's are present at a time in output labels and the rest are 0.
Code:
#Dividing the params into those which needs to be updated and rest
param_optimizer = list(model.named_parameters())
no_decay = ['bias', 'gamma', 'beta']
optimizer_grouped_parameters = [
{
'params': [p for n, p in param_optimizer if not any(nd in n for nd in no_decay)],
'weight_decay_rate': 0.01
},
{
'params': [p for n, p in param_optimizer if any(nd in n for nd in no_decay)],
'weight_decay_rate': 0.0
}
]
optimizer = BertAdam(optimizer_grouped_parameters, lr =0.05, warmup = .1)
Model Training
#Empty the GPU memory as it might be memory and CPU intensive while training
torch.cuda.empty_cache()
#Number of times the whole dataset will run through the network and model is fine-tuned
epochs = 10
epoch_count = 1
#Iterate over number of epochs
for _ in trange(epochs, desc = "Epoch"):
#Switch model to train phase where it will update gradients
model.train()
#Initaite train and validation loss, number of rows passed and number of batches passed
tr_loss = 0
nb_tr_examples, nb_tr_steps = 0, 0
val_loss = 0
nb_val_examples, nb_val_steps = 0, 0
#Iterate over batches within the same epoch
for batch in tqdm(train_dataloader):
#Shift the batch to GPU for computation
#pdb.set_trace()
batch = tuple(t.to(device) for t in batch)
#Load the input ids and masks from the batch
b_input_ids, b_input_mask, b_labels = batch
#Initiate gradients to 0 as they tend to add up
optimizer.zero_grad()
#Forward pass the input data
logits = model(b_input_ids, token_type_ids = None, attention_mask = b_input_mask)
#We will be using the Binary Cross entropy loss with added sigmoid function after that in BCEWithLogitsLoss
loss_func = BCEWithLogitsLoss()
#Calculate the loss between multilabel predicted outputs and actuals
loss = loss_func(logits, b_labels.type_as(logits))
#Backpropogate the loss and calculate the gradients
loss.backward()
#Update the weights with the calculated gradients
optimizer.step()
#Add the loss of the batch to the final loss, number of rows and batches
tr_loss += loss.item()
nb_tr_examples += b_input_ids.size(0)
nb_tr_steps += 1
#Print the current training loss
print("Train Loss: {}".format(tr_loss/nb_tr_examples))
# Save the trained model after each epoch.
# pickle.dump(model, open("conv_bert_model_"+str(epoch_count)+".pkl", "wb"))
epoch_count=epoch_count+1
I am using this loss function currently:
loss_func = BCEWithLogitsLoss()
#Calculate the loss between multilabel predicted outputs and actuals
loss = loss_func(logits, b_labels.type_as(logits))
Is there any way to improve models output.( Recall and Precision) by using different loss function?
How we tackle the cluster imbalance problem in Hugging face models in case of MULTI LABLES classification .