Questions tagged [imbalanced-data]

Problem definition

Imbalanced data occurs in when:

  • "The user assigns more importance to the predictive performance... on a subset of the target variable domain."
  • "[T]he cases that are more relevant for the user are poorly represented in the training set."

Paula Branco, Luís Torgo, and Rita P. Ribeiro. (2016) A Survey of Predictive Modeling on Imbalanced Domains. ACM Computing Surveys, Volume 49, Issue 2.

Software

Related Tags and Techniques

351 questions
2
votes
1 answer

Undersampling by groups in R to address class and features imbalance issues in hierarchical data

Let's say I have hierarchical data with heavily imbalanced observations between my target variable and a categorical predictor of interest: # Load appropriate package library(tidyverse) # Set seed for reproducibility set.seed(999) # Create a…
2
votes
0 answers

How to handle imbalanced multi-label dataset?

I am currently trying to train an image classification model using Pytorch densenet121 with 4 labels (A, B, C, D). I have 224000 images and each image is labeled in the form of [1, 0, 0, 1] (Label A and D are present in the image). I have replaced…
2
votes
1 answer

Which should I use, oversampling or undersampling?

The data I have has an imbalance. It is about 45000 : 1500 imbalance, but when oversampling, smote, and smotetomek are used, more than 97% of the results are obtained. However, when the test was actually performed, all 1500 cases had opposite…
2
votes
3 answers

focal loss NLP/text data pytorch - improving results

I have a NLP/text data classification problem where there is a very skewed distribution - class 0 - 98%, class 1 - 2% For my training and validation data I am doing oversampling and my class distribution is class 0 - 55%, class 1 - 45%. The test…
user2543622
  • 5,760
  • 25
  • 91
  • 159
2
votes
1 answer

Using Focal Loss for imbalanced dataset in PyTorch

I found this implementation of focal loss in GitHub and I am using it for an imbalanced dataset binary classification problem. # IMPLEMENTATION CREDIT: https://github.com/clcarwin/focal_loss_pytorch class FocalLoss(nn.Module): def…
Mona Jalal
  • 34,860
  • 64
  • 239
  • 408
2
votes
1 answer

How can I apply different weights for my loss funciton based on the ones coming from my train_dataloader method in Pytorch Lightning?

So basically, I am using the class from the Pytorch Lightning Module. My issue is that I'm loading my data using Pytorch Dataloader: def train_dataloader(self): train_dir = f"{self.img_dir_gender}/train" # train_transforms: from PIL to…
2
votes
0 answers

Handling Imbalanced Data with Large Dataset

I have a dataset of 6m+ rows and about 300 columns that I am currently pre-processing with dask in Python. I'm building a classifier and there is a severe class imbalance that I would normally handle using sampling methods through imblearn (random…
jxo
  • 45
  • 4
2
votes
0 answers

R Tuning Binary Prediction Threshold

I am running a multilevel binary logistic regression (MLBLR) model using glmer. After having trained the MLBLR on the training data (which was created using Tidymodels), I now intend to tune/calibrate the prediction probability threshold. I hereby…
LB.
  • 43
  • 5
2
votes
1 answer

How to solve the wrong variable type error when handling imbalance dataset by ROSE in R?

I am learning R with the Fraud Transaction data. When I try to use ROSE to handle the imbalanced dataset, the only handle continuous and categorical variables error pops up. Here's what I tried: str(dataset) 'data.frame': 6362620 obs. of 13…
WILLIAM
  • 457
  • 5
  • 28
2
votes
0 answers

R: how to get the same (high-quality) results from ranger using aligned setting for h2o(.ai) randomForest

tl;df What setting in either R::ranger or h2o.ai::randomForest can account for the very different performances on the exact same data? Background: I'm trying to classify using a somewhat strongly imbalanced dataset, and the measure-of-goodness…
EngrStudent
  • 1,924
  • 31
  • 46
2
votes
0 answers

Balanced batch generator returns inconsistent class number

I am using imblearn.keras.balanced_batch_generator in my CNN classification task. But the generator produces inconsistent classes for my data (I have 12 classes in total but it produces 10/11/12 classes when batches are yielded). This is causing an…
2
votes
1 answer

BERT classification on imbalanced or small dataset

I have a large corpus, no labels. I trained this corpus to get my BERT tokenizer. Then I want to build a BertModel to do a binary classification on a labeled dataset. However, this dataset is highly imbalanced, 1: 99. So my question is: Does…
2
votes
1 answer

Loss function for binary classification with problem of data imbalance

I try to segment of multiple sclerosis lesions in MR images using deep convolutional neural networks with keras. In this task, each voxel must be classified, either as a lesion voxel or healthy voxel. The challenge of this task is data imbalance…
2
votes
1 answer

Undersampling before or after Train/Test Split

I have a credit card dataset with 98% transactions are Non-Fraud and 2% are fraud. I have been trying to undersample the majotrity class before train and test split and get very good recall and precision on the test set. When I do the undersampling…
2
votes
1 answer

Remove rows with more than percentage of missing data for majority class samples only

Similar to this post, I am removing rows with >50% missing data to get a more reliable and complete dataset # Keep only the rows with at least x% non-NA values # calculate threshold numOfFeatures=38 # num of features in…
sums22
  • 1,793
  • 3
  • 13
  • 25
1 2
3
23 24