2

I have a NLP/text data classification problem where there is a very skewed distribution - class 0 - 98%, class 1 - 2% For my training and validation data I am doing oversampling and my class distribution is class 0 - 55%, class 1 - 45%. The test data has skewed distribution

i built a model using nn.BCEWithLogitsLoss(pos_weight=tensor(1.2579, device='cuda:0')) . pos_weight was calculated using 55/45 (class distribution in training data.)

and on my class 1 of test data I got f1 performance of 0.07, true negatives, false positives, false negative, true positive = (28809, 13258, 537, 495)

I changed to focal loss using below code and my performance didnt improve a lot. f1 on class 1 of test data is still same and true negatives, false positives, false negative, true positive = (32527, 9540, 640, 392)

kornia.losses.binary_focal_loss_with_logits(probssss, labelsss,alpha=0.25,gamma=2.0,reduction='mean')

  1. are my alpha and gamma parameters wrong? Are there any specific values that I should try? I could try to tune them but it might take a lot of time and resources. therefore I am looking for recommendations
  2. for my nn.BCEWithLogitsLoss(pos_weight=tensor(1.2579, device='cuda:0')) should I use any other value for pos_weight? Please remember that my goal is to get maximum f1 performance for test data class 1

#update

I am building a CNN using glove embedding - i take my text and find their glove embedding - i am removing all punctuation and apart from that no other major data cleaning. I am interested in tuning parameters of the focal loss - alpha and gamma

My model is as below

class CNN(nn.Module):
    
    def __init__(self,
                 pretrained_embedding,
                 embed_dim,
                 filter_sizes,
                 num_filters,
                 fc1_neurons,
                 fc2_neurons,
                 dropout):

        super(CNN, self).__init__()
        
        # Embedding layer
        self.vocab_size, self.embed_dim = pretrained_embedding.shape
        self.embedding = nn.Embedding.from_pretrained(pretrained_embedding,
                                                      freeze=True)

        # Conv Network
        self.conv1d_list = nn.ModuleList([
            nn.Conv1d(in_channels=self.embed_dim,
                      out_channels=num_filters[i],
                      kernel_size=filter_sizes[i])
            for i in range(len(filter_sizes))
        ])
        
        #Batchnorm
        self.batch_norm1 = nn.BatchNorm1d(num_filters[0] * len(filter_sizes))
        
        # Dropout Layer
        self.dropout = nn.Dropout(p=dropout)
        
        # RELU activation function
        self.relu =  nn.ReLU()
        
        # Fully-connected layers
#         self.fc1 = nn.Linear(np.sum(num_filters), fc1_neurons)
        
        self.batch_norm2 = nn.BatchNorm1d(num_filters)
        
        self.fc2 = nn.Linear(np.sum(num_filters), fc2_neurons)
        
        self.batch_norm3 = nn.BatchNorm1d(fc2_neurons)
        
        self.fc3 = nn.Linear(fc2_neurons, 1)
user2543622
  • 5,760
  • 25
  • 91
  • 159

3 Answers3

0

I think more important than these parameter values is how the features are being trained. Do you train a NLP model from the ground i.e. only training on your texts or do you use a partially pretrained model? I suggest the latter given your sample sizes.

bert wassink
  • 350
  • 3
  • 9
0

I think you should try LSTM, GRU, or Transformers based approach. I will recommend transformer models such as BERT, Distilbert, Roberta, etc. You can train from scratch or use pretrained to fine-tune it. It will give you a better F1 score than CNN based approach.

Also, you can try adding class weights. It might help to improve the accuracy.

k.avinash
  • 81
  • 7
0

I encourage you to check sklearn.utils.class_weight.compute_sample_weight to calculate weights of a sample and sklearn.utils.class_weight.compute_class_weight for the individual class weight.

Since you have a 2% - 98% class repartition focal loss is definitely a good call ! I think your params are ok.

Next have you tried to use a sampler in your dataloader ? I think torch.utils.data.WeightedRandomSampler is what you need.

Here is a little example :

from torch.utils.data import WeightedRandomSampler, DataLoader
from sklearn.utils.class_weight import compute_sample_weight

torch.manual_seed(0)

weights = compute_sample_weight('balanced', dataset.classes)
sampler = WeightedRandomSampler(weights, len(weights))
loader = DataLoader(dataset, sampler=sampler)

counts_train = [0 for _ in range(2)]
for x, y in loader:
  counts_train[y] += 1
$ counts_train
[499, 501]

I have generated a dataset of 1000 examples with your reparation, dataset.classes is an array of size 1000, it contains all the labels.

I would keep the same label repartition in all my subsets. Doing so will increase model stability.

Don't hesitate to use data augmentation only on your train set. It will increase the number of examples and make the model more robust. You can check this repo.

Hop it helps !

  • what does `WeightedRandomSampler` do? woudl it be possibel to provide a quick summary? I googled it but seems that I have to spend some time to understand it – user2543622 Apr 01 '22 at 15:27
  • It will ensure that classes are returned according to the weights you provide. If you follow my code I setup it in a way that all the classes are returned equally. You can see that when I printed the array `counts_train`, even if the class 1 account for 2% it provide 50% of class 0 and 50% of class 1. – Marc HENRIOT Apr 01 '22 at 17:48