2

I am training a sparse multi-label text classification problem using Hugging Face models which is one part of SMART REPLY System. The task which I am doing is mentioned below:

I classify Customer Utterances as input to the model and classify to which Agent Response clusters it belongs to. I have 60 clusters and Customer Utterances can map to one or more clusters.

Input to Model

Input                             Output

My account is blocked             [0,0,0,1,1,0....0,0,0,0,0]

The Output is Encoding Vector for Cluster labels. In the above example the customer query maps into cluster 4 and cluster 5 of agent responses.

Problem:

The model always predict the cluster numbers which are very frequent. It doesn't take the rare clusters.

Only few 1's are present at a time in output labels and the rest are 0.

Code:

#Dividing the params into those which needs to be updated and rest
param_optimizer = list(model.named_parameters())
no_decay = ['bias', 'gamma', 'beta']
optimizer_grouped_parameters = [
    {
        'params': [p for n, p in param_optimizer if not any(nd in n for nd in no_decay)],
        'weight_decay_rate': 0.01
    },
    {
        'params': [p for n, p in param_optimizer if any(nd in n for nd in no_decay)],
        'weight_decay_rate': 0.0
    }
]

optimizer = BertAdam(optimizer_grouped_parameters, lr =0.05, warmup = .1)

Model Training

#Empty the GPU memory as it might be memory and CPU intensive while training
torch.cuda.empty_cache()
#Number of times the whole dataset will run through the network and model is fine-tuned
epochs = 10
epoch_count = 1
#Iterate over number of epochs
for _ in trange(epochs, desc = "Epoch"):
    #Switch model to train phase where it will update gradients
    model.train()
    #Initaite train and validation loss, number of rows passed and number of batches passed
    tr_loss = 0
    nb_tr_examples, nb_tr_steps = 0, 0
    val_loss = 0
    nb_val_examples, nb_val_steps = 0, 0
   
    #Iterate over batches within the same epoch
    for batch in tqdm(train_dataloader):
        #Shift the batch to GPU for computation
        #pdb.set_trace()
        batch = tuple(t.to(device) for t in batch)
        #Load the input ids and masks from the batch
        b_input_ids, b_input_mask, b_labels = batch
        #Initiate gradients to 0 as they tend to add up
        optimizer.zero_grad()
        #Forward pass the input data
        logits = model(b_input_ids, token_type_ids = None, attention_mask = b_input_mask)
        #We will be using the Binary Cross entropy loss with added sigmoid function after that in BCEWithLogitsLoss
        loss_func = BCEWithLogitsLoss()
        #Calculate the loss between multilabel predicted outputs and actuals
        loss = loss_func(logits, b_labels.type_as(logits))
        
        #Backpropogate the loss and calculate the gradients
        loss.backward()
        #Update the weights with the calculated gradients
        optimizer.step()
        #Add the loss of the batch to the final loss, number of rows and batches
        tr_loss += loss.item()
        nb_tr_examples += b_input_ids.size(0)
        nb_tr_steps += 1
    #Print the current training loss 
    print("Train Loss: {}".format(tr_loss/nb_tr_examples))
    
    # Save the trained model after each epoch.
#     pickle.dump(model, open("conv_bert_model_"+str(epoch_count)+".pkl", "wb"))
    epoch_count=epoch_count+1

I am using this loss function currently:

loss_func = BCEWithLogitsLoss()
#Calculate the loss between multilabel predicted outputs and actuals
loss = loss_func(logits, b_labels.type_as(logits))

Is there any way to improve models output.( Recall and Precision) by using different loss function?

How we tackle the cluster imbalance problem in Hugging face models in case of MULTI LABLES classification .

cronoik
  • 15,434
  • 3
  • 40
  • 78
MAC
  • 1,345
  • 2
  • 30
  • 60

1 Answers1

2

You can use a weighted cross entropy applied to each index of the output. You'd have to go through the training set to calculate the weights of each cluster.

criterion = nn.BCEWithLogitsLoss(reduction='none')
loss = criterion(output, target)
loss = (loss * weights).mean()
loss.backward()

By doing so the losses for different indexes are not combined immediately, but kept separate. They are first multiplied with the weights and then combined.

To calculate the weights, assuming outputs is a Tensor:

weights = torch.sum(outputs, 0)/torch.sum(outputs)

And assuming numpy arrays:

weights = np.sum(outputs, 0)/np.sum(outputs)
Kroshtan
  • 637
  • 5
  • 17
  • You mean to say for each clusters we should assign a weight, something like for `cluster 1` weight can be `(1/n)` where n is total data points in `cluster 1`? – MAC Sep 08 '21 at 13:45
  • In your above solution `(loss * weights)` `loss` and `weights` are as same dimension as one hot encoded labels? – MAC Sep 08 '21 at 13:48
  • The labels do not need to be one-hot, they can be multihot. Yes, the weights are calculated by summing and dividing by n. – Kroshtan Sep 08 '21 at 13:50
  • @MAC the weights need to be calculated by summing the occurence of each class divided by the total amount of occurences. I'll add this to the answer. – Kroshtan Sep 10 '21 at 05:33
  • @MAC If the above answer solved your problem, please accept it. If not, please provide additional information so that a solution can be found. – Kroshtan Sep 13 '21 at 13:04
  • It does not. I will update will the problems using these approach. – MAC Sep 13 '21 at 13:14