2

I'm working on a NLP classification problem and I noticed that there is a huge disparities between classes.

I'm working with a dataset with 44k~ observations with 99 labels. Out of those 99 labels, only 21 have more than 500 observations and some have as little as 2 observations. Here is a look at the top 21 labels:

enter image description here

What do you guys suggest I should do? Should I just remove labels that don't exist after a certain threshold? I looked into data augmentation techniques but I couldn't find clear documentation for how to do it with the French language.

If you need me to provide more details please let me know!

EDIT: So I created a category called "autre" ( means "other" in English) in which I put all of the underrepresented categories ( under 300 occurances ). So the data repartition looks like this now: enter image description here

then I wrote this code to oversample from the underrepresented categories:

import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split


# Calculate the occurrences of each category and store it in the 'Occurrences' column
training['Occurrences'] = training['Domaine sou domaine '].map(training['Domaine sou domaine '].value_counts())

# Set the desired ratio for each category (e.g., 0.5 means each category will have occurrences at least 50% of the maximum count)
desired_ratio = 0.5

# Random undersampling to reduce the degree of imbalances
balanced_data = pd.DataFrame()

for label in training['Domaine sou domaine '].unique():
    max_occurrences = training['Occurrences'].max()
    desired_occurrences = int(max_occurrences * desired_ratio)
    
    # Set replace=True to sample with replacement
    samples = training[training['Domaine sou domaine '] == label].sample(n=desired_occurrences, replace=True)
    balanced_data = pd.concat([balanced_data, samples])

# Selecting the specified columns as features
cols = ["Programme de formation", "Description du champ supplémentaire : Objectifs de la formation", "Intitulé (Ce champ doit respecter la nomenclature suivante : Code action – Libellé)_y"]
X = balanced_data[cols]

# Extracting the labels
y = balanced_data['Domaine sou domaine ']

# Splitting the data into training and testing datasets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

The problem here is that the model became TOO GOOD. I was getting at best 58% accuracy but now it's 85% when the val_loss is at minimum.

QUESTION: Is the model overfitting with the small classes? Example: let's take the least repeated category with 312 observations. We will be repeating those observations randomly almost 10 times according to this formula(desired_occurrences = int(max_occurrences * desired_ratio).

If the model is indeed overfitting and I shouldn't take the 85% accuracy seriously, what should I do next?

wageeh
  • 13
  • 1
  • 5
  • 18

3 Answers3

4

85% because of duplicates in train and testset
The 85% can be explained because your train_test_split is executed after you rebalance your dataset. While rebalancing certain examples will be duplicated and can thus occur in your training and testset after the split. Avoid this by splitting first and then rebalancing your trainingset. Note that you have sample(replace=true).

Remove small categories
It is always depending on the use case but I often obtain better results when removing highly under-represented categories instead of creating an 'others' category. At inference, the confidence levels for predictions of such categories are likely (hopefully) to be lower. If you then set a minimum threshold, no prediction is made for these cases. Of course, this only works if you can build a fallback process.

Check if model is overfitting
If you have a representative testset and no samples of the trainset are repeated in it. You can assume that the performance is legit.

Data augmentation
A data augmentation steps that works quite will for text is machine translation. Translate the samples of your under-represented clauses to different languages and back.
Eg. FR -> ESP --> EN --> FR. Using more exotic languages will result in more diverse new samples.

Simi
  • 262
  • 1
  • 8
  • Thank you for your answer! Just a question, can you elaborate on the second point? Do you suggest I completely delete the under-represented categories from the dataset or are you saying there is a way to omit them from being predicted? Also, what do you mean by a fallback process? – wageeh Jul 27 '23 at 08:37
  • I do indeed mean to delete the categories that for example only represent 1% of your dataset. The most basic fallback scenario could just be to process these samples manually. If you can already automate 80% of the work in an actual business use case and 20% still needs to happen manually, your boss will already be very happy with the significant savings. – Simi Jul 27 '23 at 09:13
  • If the under-represented categories are removed from the model, they will of course not be predicted at inference, resulting in an incorrect prediction. If the category is only 1 percent, the impact on your precision will be low. However, even better is to detect this 1 percent and not make a prediction for them. This could be done hoping their confidence score is below a certain threshold or have them being recognized in an 'others' category. – Simi Jul 27 '23 at 09:17
1

Your initial thoughts to encounter the imbalance problem are in the correct path I believe. Particularly,

What do you guys suggest I should do? Should I just remove labels that don't exist after a certain threshold?

This can certainly be the case if it is applicable, without knowing the usecase of the outcome model. If every label is required to be included, this might serve as an initial experiment anyway, and you can get insights about the model performance and trials for improvement. Thus, no hurt trying this.

I looked into data augmentation techniques but I couldn't find clear documentation for how to do it with the french language.

Augmentation on textual data is a bit blurry and relatively difficult (to CV/Audio), but there are techniques for textual data augmentation. Although these techniques mostly reside in English, you can try to transfer the augmentation idea to another language if applicable. Of course, for this there should be no methods for a specific language (e.g. English). Mostly, these are the ones with hardcoded vocab or alike. You can try few techniques outlined here while I'd go with back translation first if I were you (not lang. specific, and you can do with many languages with current models/APIs).

However, I think there are currently more methods after the rise of instruction following models (e.g. ChatGPT). You can try to prompt engineer your way a bit to generate instances for your categories.

A third way could be to use a algorithm-based technique such as a loss function incorporating the class imbalance (e.g. Focal loss, other class balancing/re-weighting losses). These loss functions are generally used by vision community, there is no reason not to use it in NLP.

My advise after here is to review the literature on imbalance learning as there are loads of studies out there. The literature generally divides the methods in 4 folds; but I can write in a higher level as 3 data-based techniques, algorithm-based techniques, hybrid methods. Here, a recent survey paper would be the 1st choice to see both current SOTA methods and the general view of how the methods historacilly evolved. For example, you can start by reading this paper, and backtrack the literature as needed.

null
  • 1,944
  • 1
  • 14
  • 24
0

I would suggest doing two things:

  1. When splitting your data for training and testing, it would be better to do so before applying undersampling or oversampling techniques. I suggest this because: a) In this situation, you will measure the performance of your test data in a more realistic situation. With new data, you're likely to encounter imbalanced data, and it's better to know how your model will perform under these real circumstances. b) Your training set data will not be in the test set (I think that's the primary reason why you're seeing an improvement in your metrics, though I could be mistaken). You can use the train_test_split(..., stratify=training['Domaine_sous_domaine']) parameter to split the data in a stratified manner.
  2. I see that you're using undersampling techniques, but I suggest trying oversampling because more data tends to lead to a better model. You could use imblearn.over_sampling.RandomOverSampler instead of manual calculations.

Another way to balance your classes is to find the parameter in your classification model responsible for this. For instance, in LogisticRegression and RandomForestClassifier, you can set up class_weight='balanced'.

I hope this helps, and good luck!

MaryRa
  • 463
  • 1
  • 4