I'm working on a NLP classification problem and I noticed that there is a huge disparities between classes.
I'm working with a dataset with 44k~ observations with 99 labels. Out of those 99 labels, only 21 have more than 500 observations and some have as little as 2 observations. Here is a look at the top 21 labels:
What do you guys suggest I should do? Should I just remove labels that don't exist after a certain threshold? I looked into data augmentation techniques but I couldn't find clear documentation for how to do it with the French language.
If you need me to provide more details please let me know!
EDIT: So I created a category called "autre" ( means "other" in English) in which I put all of the underrepresented categories ( under 300 occurances ). So the data repartition looks like this now:
then I wrote this code to oversample from the underrepresented categories:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
# Calculate the occurrences of each category and store it in the 'Occurrences' column
training['Occurrences'] = training['Domaine sou domaine '].map(training['Domaine sou domaine '].value_counts())
# Set the desired ratio for each category (e.g., 0.5 means each category will have occurrences at least 50% of the maximum count)
desired_ratio = 0.5
# Random undersampling to reduce the degree of imbalances
balanced_data = pd.DataFrame()
for label in training['Domaine sou domaine '].unique():
max_occurrences = training['Occurrences'].max()
desired_occurrences = int(max_occurrences * desired_ratio)
# Set replace=True to sample with replacement
samples = training[training['Domaine sou domaine '] == label].sample(n=desired_occurrences, replace=True)
balanced_data = pd.concat([balanced_data, samples])
# Selecting the specified columns as features
cols = ["Programme de formation", "Description du champ supplémentaire : Objectifs de la formation", "Intitulé (Ce champ doit respecter la nomenclature suivante : Code action – Libellé)_y"]
X = balanced_data[cols]
# Extracting the labels
y = balanced_data['Domaine sou domaine ']
# Splitting the data into training and testing datasets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
The problem here is that the model became TOO GOOD. I was getting at best 58% accuracy but now it's 85% when the val_loss is at minimum.
QUESTION: Is the model overfitting with the small classes? Example: let's take the least repeated category with 312 observations. We will be repeating those observations randomly almost 10 times according to this formula(desired_occurrences = int(max_occurrences * desired_ratio).
If the model is indeed overfitting and I shouldn't take the 85% accuracy seriously, what should I do next?