How to improve the results of this neural network of finetuned BERT model?

Question

I'm working on a NLP classification problem where I'm trying to classify training courses into 99 categories. I managed to make a few models including the Bayesian classifier but it had an accuracy of 55% (very bad).

Given those results, I tried to fine-tune the camemBERT model (my data is in french) to improve the model results but I never used these methods before so I tried to follow this example and adapt it to my code.

In the example above, there are 2 labels while I have 99 labels.

I left certain parts intact

epochs = 5
MAX_LEN = 128
batch_size = 16
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
tokenizer = CamembertTokenizer.from_pretrained('camembert-base',do_lower_case=True)

I selected the same variable names, in text you have the feature column and in labels you have the labels

text = training['Intitulé (Ce champ doit respecter la nomenclature suivante : Code action – Libellé)_x']
labels = training['Domaine sou domaine ']

I tokenized and padded the sequences using the same values in the example because I didn't know which values are right for my data

#user tokenizer to convert sentences into tokenizer
input_ids = [tokenizer.encode(sent, add_special_tokens=True, max_length=MAX_LEN) for sent in text]

# Pad our input tokens
input_ids = pad_sequences(input_ids, maxlen=MAX_LEN, dtype="long", truncating="post", padding="post")

# Create attention masks
attention_masks = []
# Create a mask of 1s for each token followed by 0s for padding
for seq in input_ids:
    seq_mask = [float(i > 0) for i in seq]
    attention_masks.append(seq_mask)

I noticed that the labels are numeric in the example above so I changed my labels to numeric using this code

label_map = {label: i for i, label in enumerate(set(labels))}
numeric_labels = [label_map[label] for label in labels]
labels = numeric_labels

I started building the model starting with the tensors

# Use train_test_split to split our data into train and validation sets for training
train_inputs, validation_inputs, train_labels, validation_labels = train_test_split(
    input_ids, labels, random_state=42, test_size=0.1
)

train_masks, validation_masks = train_test_split(
    attention_masks, random_state=42, test_size=0.1
)

# Convert the data to torch tensors
train_inputs = torch.tensor(train_inputs)
validation_inputs = torch.tensor(validation_inputs)
train_labels = torch.tensor(train_labels)
validation_labels = torch.tensor(validation_labels)
train_masks = torch.tensor(train_masks)
validation_masks = torch.tensor(validation_masks)

# Create data loaders
train_data = TensorDataset(train_inputs, train_masks, train_labels)
train_sampler = RandomSampler(train_data)
train_dataloader = DataLoader(train_data, sampler=train_sampler, batch_size=batch_size)

validation_data = TensorDataset(validation_inputs, validation_masks, validation_labels)
validation_sampler = SequentialSampler(validation_data)
validation_dataloader = DataLoader(validation_data, sampler=validation_sampler, batch_size=batch_size)
# Define the model architecture
model = CamembertForSequenceClassification.from_pretrained('camembert-base', num_labels=99)

# Move the model to the appropriate device
model.to(device)

the output is:

CamembertForSequenceClassification(
  (roberta): RobertaModel(
    (embeddings): RobertaEmbeddings(
      (word_embeddings): Embedding(32005, 768, padding_idx=1)
      (position_embeddings): Embedding(514, 768, padding_idx=1)
      (token_type_embeddings): Embedding(1, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (encoder): RobertaEncoder(
      (layer): ModuleList(
        (0-11): 12 x RobertaLayer(
          (attention): RobertaAttention(
            (self): RobertaSelfAttention(
              (query): Linear(in_features=768, out_features=768, bias=True)
              (key): Linear(in_features=768, out_features=768, bias=True)
              (value): Linear(in_features=768, out_features=768, bias=True)
              (dropout): Dropout(p=0.1, inplace=False)
            )
            (output): RobertaSelfOutput(
              (dense): Linear(in_features=768, out_features=768, bias=True)
              (LayerNorm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
              (dropout): Dropout(p=0.1, inplace=False)
            )
          )
          (intermediate): RobertaIntermediate(
            (dense): Linear(in_features=768, out_features=3072, bias=True)
            (intermediate_act_fn): GELUActivation()
          )
          (output): RobertaOutput(
            (dense): Linear(in_features=3072, out_features=768, bias=True)
            (LayerNorm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
            (dropout): Dropout(p=0.1, inplace=False)
          )
        )
      )
    )
  )
  (classifier): RobertaClassificationHead(
    (dense): Linear(in_features=768, out_features=768, bias=True)
    (dropout): Dropout(p=0.1, inplace=False)
    (out_proj): Linear(in_features=768, out_features=99, bias=True)
  )
)

Then I proceeded with creating the neural network

param_optimizer = list(model.named_parameters())
optimizer_grouped_parameters = [{'params': [p for n, p in param_optimizer], 'weight_decay_rate': 0.01}]
optimizer = AdamW(optimizer_grouped_parameters, lr=2e-5, eps=10e-8)

# Function to calculate the accuracy of our predictions vs labels
def flat_accuracy(preds, labels):
    pred_flat = np.argmax(preds, axis=1).flatten()
    labels_flat = labels.flatten()
    return np.sum(pred_flat == labels_flat) / len(labels_flat)

train_loss_set = []

# trange is a tqdm wrapper around the normal python range
for _ in trange(epochs, desc="Epoch"):  
    # Tracking variables for training
    tr_loss = 0
    nb_tr_examples, nb_tr_steps = 0, 0
  
    # Train the model
    model.train()
    for step, batch in enumerate(train_dataloader):
        # Add batch to device CPU or GPU
        batch = tuple(t.to(device) for t in batch)
        # Unpack the inputs from our dataloader
        b_input_ids, b_input_mask, b_labels = batch
        # Clear out the gradients (by default they accumulate)
        optimizer.zero_grad()
        # Forward pass
        outputs = model(b_input_ids, token_type_ids=None, attention_mask=b_input_mask, labels=b_labels)
        # Get loss value
        loss = outputs[0]
        # Add it to train loss list
        train_loss_set.append(loss.item())    
        # Backward pass
        loss.backward()
        # Update parameters and take a step using the computed gradient
        optimizer.step()
    
        # Update tracking variables
        tr_loss += loss.item()
        nb_tr_examples += b_input_ids.size(0)
        nb_tr_steps += 1

    print("Train loss: {}".format(tr_loss / nb_tr_steps))

    # Tracking variables for validation
    eval_loss, eval_accuracy = 0, 0
    nb_eval_steps, nb_eval_examples = 0, 0
    # Validation of the model
    model.eval()
    # Evaluate data for one epoch
    for batch in validation_dataloader:
        # Add batch to device CPU or GPU
        batch = tuple(t.to(device) for t in batch)
        # Unpack the inputs from our dataloader
        b_input_ids, b_input_mask, b_labels = batch
        # Telling the model not to compute or store gradients, saving memory and speeding up validation
        with torch.no_grad():
            # Forward pass, calculate logit predictions
            outputs = model(b_input_ids, token_type_ids=None, attention_mask=b_input_mask, labels=b_labels)
            loss, logits = outputs[:2]
    
        # Move logits and labels to CPU if GPU is used
        logits = logits.detach().cpu().numpy()
        label_ids = b_labels.to('cpu').numpy()

        tmp_eval_accuracy = flat_accuracy(logits, label_ids)
    
        eval_accuracy += tmp_eval_accuracy
        nb_eval_steps += 1

    print("Validation Accuracy: {}".format(eval_accuracy / nb_eval_steps))

And the code worked, but the accuracy level was at 30%, which is way worse than a Bayesian classifier that uses a very simple algorithm and straightforward calculation. This made me realize that I must have fine-tuned the model wrongly, but I don't understand fine-tuning well enough to know where I went wrong.

Is your task simpler and the patterns in your data are easier to capture? It yes, simpler models like a Bayesian classifier might perform better. Or is your data highly imbalanced, i.e., some classes have a lot more samples than others? This could cause the model to perform poorly on the underrepresented classes. — VonC, Jul 08 '23 at 18:26
@VonC yes some classes have a lot more samples than others. What do you suggest doing about that? — Wajih101, Jul 10 '23 at 08:10
Oversampling, maybe? Possibly [using SMOTE](https://machinelearningmastery.com/smote-oversampling-for-imbalanced-classification/). That, or class weighting (to help the model pay more 'attention' to the underrepresented classes during training). — VonC, Jul 10 '23 at 08:29
@SilentCloud I do have 40k observations for 99 categories but some categories have only 2 observations while others have hundreds, it's very imbalanced. I'm going to try and use SMOTE or filter out certain categories and test the model. — Wajih101, Jul 13 '23 at 09:08
@VonC can SMOTE be used for string data or is it better to look for some text augmentation methods? — Wajih101, Jul 13 '23 at 11:10
@Wajih101 Not really. I have [posted an answer](https://stackoverflow.com/a/76678873/6309) to address your comment. — VonC, Jul 13 '23 at 11:29

score 2 · Answer 1 · answered Jul 11 '23 at 08:08

I'm currently working on some sequence classification task and something I noticed during my training probably help you in your case.

Truncation: If there's a sentence greater than 128 tokens(MAX_LEN) and you are truncating it, then essentially model is only able to predict on partial data point(partial string as the string is truncated if it's length is >128 tokens).

So, for my usecase, I was using Roberta Model which has a MAXLENGTH of 512 tokens. I cannot go beyond that in a given datapoint. So I had to generate windows of the each string into multiple sub-sequences of 512 tokens and do padding on the last sub string(if it's less than 512 tokens) since every datapoint will not always be in exact multiples of 512 tokens. Then I aggregated the predictions of each sequence.

While it was a trick I used which seemed to be realistic to me, what you can actually do is like below -

I'm not aware of the BERT model you are using but could you try increasing the max length to the maximum allowed(not sure if it's 128 itself) accomodate most of your data points without any truncation. points.
How to do this? : You may create a distribution plot on tokens of each datapoint and see if the median/mean/nth percentile/max of the distribution can be a max_length parameter and train the model on this.

Im not sure about your data but, @VonC suggested to use SMOTE, similar to that but you may also use generative AI(openai) etc to increase your data size. — sastaengineer, Jul 11 '23 at 08:16

SajanGohil · Answer 2 · 2023-07-13T09:27:36.970

You should use camembert or any other language model just to extract text features. After that, you can use a classifier to classify the feature vectors as inputs.

Training a language model might require a lot of data and compute, if you don't have those, using a pretrained network as feature extractor would better.

from transformers import AutoTokenizer, CamembertModel
from sklearn.neighbors import KNeighborsClassifier
import torch

tokenizer = AutoTokenizer.from_pretrained("camembert-base")
model = CamembertModel.from_pretrained("camembert-base")

# Store features of all inputs
input_features = []
input_labels = []
with torch.no_grad():
    for input_text, label in data.items(): # Or however your data is stored
        inputs = tokenizer(input_text, return_tensors="pt")
        outputs = model(**inputs)

        last_hidden_states = outputs.last_hidden_state
        # You might have to convert the last_hidden_states tensor to numpy array
        # I am using [0,0] assuming 1 batch and [CLS] token position similar to bert
        input_features.append(last_hidden_states[0,0])
        input_labels.append(label)

# Use any classifier which might work well for large amount of classes
neigh = KNeighborsClassifier(n_neighbors=3)
neigh.fit(input_features, input_labels)

Edit:

Explanation: CamembertForSequenceClassification (or any deeplearning model that accomplishes a task like classification) can be seen as having 2 parts:

A base model that does feature extraction, i.e. maps inputs (texts) to a latent space (high dimensional space). This mapping is just a representation of inputs like text to a different format which describes the "qualities" of each input sample.
A task head that performs a required task like classification using the new format of the data. It basically makes a decision that if a datapoint lies at X coordinate in the latent space, it means it is likely in y class for classification task and something else for some other task.

In the case of CamembertForSequenceClassification the feature extractor is CamembertModel and the classifier head is CamembertClassificationHead (which is linear-dropout-linear layer) refer here.

As you can see the classification head is just 2 layers which can be trained easily and you can make use of the pretrained nature of the base model. As base model is also available separately, you can use a classification method different from the 2 linear layers like KNN which might work better for larger amount of classes with few samples.

Thank you for your answer! I however don't understand the code much. What do you mean by "You should use camembert just to extract text features"? — Wajih101, Jul 13 '23 at 09:10
@Wajih101 I have added an explanation about feature extractors — SajanGohil, Jul 13 '23 at 09:28

score 2 · Answer 3 · answered Jul 13 '23 at 11:29

The OP mentions in the comments that some classes have a lot more samples than others.

I suggested using SMOTE (Synthetic Minority Oversampling Technique).

That, or class weighting (to help the model pay more 'attention' to the underrepresented classes during training

However, the OP adds:

I do have 40k observations for 99 categories, but some categories have only 2 observations while others have hundreds, it's very imbalanced. I'm going to try and use SMOTE or filter out certain categories and test the model.

Can SMOTE be used for string data or is it better to look for some text augmentation methods?

SMOTE is an algorithm originally designed for continuous data, and using it with categorical or text data can be a bit tricky. There are adaptations of SMOTE for categorical data (like SMOTE-NC), but even these might not be perfect for text data.

For text data, there are several ways you can perform augmentation:

Synonym Replacement: Replace words in the sentence with their synonyms.
To illustrate that, see "Token replacement-based data augmentation methods for hate speech detection" from Kosisochukwu Judith Madukwe, Xiaoying Gao & Bing Xue (2022)
Random Insertion: A new word that is a synonym of an existing word is inserted into a random position in the sentence.
See "Data Augmentation in NLP: Best Practices From a Kaggle Master" from Shahul ES
Random Swap: Two words are randomly selected in the sentence and their positions are swapped.
See "Data Augmentation in Natural Language Processing (NLP)" from Rachel Zheng
Random Deletion: Randomly remove words from the sentence, with the likelihood of a word being removed proportional to its frequency in the sentence, one of the rule-based augmentation listed in "Text Data Augmentation for Deep Learning" by Connor Shorten, Taghi M. Khoshgoftaar & Borko Furht.

These techniques can help to create more examples of the under-represented classes in your text classification task. One thing to note, though, is that while these techniques can create more examples, the examples are not truly 'new' data, and so the model might still struggle if the classes with few examples are fundamentally hard to classify.

Text augmentation tools like the Python library nlpaug can help you perform these types of augmentation. It provides functionalities for various augmentation methods, including substitution of word by word embeddings, substitution of character, inserting new character/word, swap of character/word, and deletion of character/word.

Another option is to combine text augmentation and class weighting (as I mentioned before) to handle the imbalance problem. That could work better if the classes with very few examples in your dataset are hard to predict even with augmented data.

Remember to verify the quality of your augmented data, and ensure that the augmented data maintains the original meaning and context. The quality of your augmented data can significantly affect your model's performance.

Lastly, you could also look at more advanced over-sampling techniques for text data, such as the Contextualized Over-Sampling (COS) method, which leverages transformers (like BERT) to generate semantically similar sentences. See for instance "BERT for Sequence Labelling with Imbalanced Data" by Lorenzo Pozzi.

How to improve the results of this neural network of finetuned BERT model?

3 Answers3