Questions tagged [imbalanced-data]

Problem definition

Imbalanced data occurs in machine-learning when:

"The user assigns more importance to the predictive performance... on a subset of the target variable domain."

"[T]he cases that are more relevant for the user are poorly represented in the training set."

Paula Branco, Luís Torgo, and Rita P. Ribeiro. (2016) A Survey of Predictive Modeling on Imbalanced Domains. ACM Computing Surveys, Volume 49, Issue 2.

Software

imbalanced-learn: imblearn

Related Tags and Techniques

SMOTE: smote (Synthetic Minority Oversampling Technique)
Resampling: resampling
Oversampling: oversampling
Downsampling: downsampling

351 questions

votes

1 answer

Undersampling by groups in R to address class and features imbalance issues in hierarchical data

Let's say I have hierarchical data with heavily imbalanced observations between my target variable and a categorical predictor of interest: # Load appropriate package library(tidyverse) # Set seed for reproducibility set.seed(999) # Create a…

r machine-learning dplyr imbalanced-data

asked Oct 28 '22 at 12:44

Philippe Duteil

votes

0 answers

How to handle imbalanced multi-label dataset?

I am currently trying to train an image classification model using Pytorch densenet121 with 4 labels (A, B, C, D). I have 224000 images and each image is labeled in the form of [1, 0, 0, 1] (Label A and D are present in the image). I have replaced…

machine-learning pytorch multilabel-classification imbalanced-data

asked Jun 18 '22 at 05:44

Wei Feng

votes

1 answer

Which should I use, oversampling or undersampling?

The data I have has an imbalance. It is about 45000 : 1500 imbalance, but when oversampling, smote, and smotetomek are used, more than 97% of the results are obtained. However, when the test was actually performed, all 1500 cases had opposite…

python machine-learning scikit-learn sampling imbalanced-data

asked Jun 14 '22 at 00:55

Tae In Kim

votes

3 answers

focal loss NLP/text data pytorch - improving results

I have a NLP/text data classification problem where there is a very skewed distribution - class 0 - 98%, class 1 - 2% For my training and validation data I am doing oversampling and my class distribution is class 0 - 55%, class 1 - 45%. The test…

python nlp loss-function imbalanced-data

asked Mar 23 '22 at 16:54

user2543622

5,760
25
91
159

votes

1 answer

Using Focal Loss for imbalanced dataset in PyTorch

I found this implementation of focal loss in GitHub and I am using it for an imbalanced dataset binary classification problem. # IMPLEMENTATION CREDIT: https://github.com/clcarwin/focal_loss_pytorch class FocalLoss(nn.Module): def…

python deep-learning pytorch computer-vision imbalanced-data

asked Feb 28 '22 at 20:15

Mona Jalal

34,860
64
239
408

votes

1 answer

How can I apply different weights for my loss funciton based on the ones coming from my train_dataloader method in Pytorch Lightning?

So basically, I am using the class from the Pytorch Lightning Module. My issue is that I'm loading my data using Pytorch Dataloader: def train_dataloader(self): train_dir = f"{self.img_dir_gender}/train" # train_transforms: from PIL to…

python deep-learning cross-entropy imbalanced-data pytorch-lightning

asked Feb 25 '22 at 20:02

Quentin Bracq

votes

0 answers

Handling Imbalanced Data with Large Dataset

I have a dataset of 6m+ rows and about 300 columns that I am currently pre-processing with dask in Python. I'm building a classifier and there is a severe class imbalance that I would normally handle using sampling methods through imblearn (random…

python pandas dask imbalanced-data

asked Nov 18 '21 at 04:08

jxo

votes

0 answers

R Tuning Binary Prediction Threshold

I am running a multilevel binary logistic regression (MLBLR) model using glmer. After having trained the MLBLR on the training data (which was created using Tidymodels), I now intend to tune/calibrate the prediction probability threshold. I hereby…

r performance prediction threshold imbalanced-data

asked Nov 04 '21 at 11:01

LB.

votes

1 answer

How to solve the wrong variable type error when handling imbalance dataset by ROSE in R?

I am learning R with the Fraud Transaction data. When I try to use ROSE to handle the imbalanced dataset, the only handle continuous and categorical variables error pops up. Here's what I tried: str(dataset) 'data.frame': 6362620 obs. of 13…

r dataframe imbalanced-data

asked Oct 06 '21 at 02:05

WILLIAM

votes

0 answers

R: how to get the same (high-quality) results from ranger using aligned setting for h2o(.ai) randomForest

tl;df What setting in either R::ranger or h2o.ai::randomForest can account for the very different performances on the exact same data? Background: I'm trying to classify using a somewhat strongly imbalanced dataset, and the measure-of-goodness…

r random-forest imbalanced-data r-ranger h2o.ai

asked Sep 22 '21 at 22:13

EngrStudent

1,924
31
46

votes

0 answers

Balanced batch generator returns inconsistent class number

I am using imblearn.keras.balanced_batch_generator in my CNN classification task. But the generator produces inconsistent classes for my data (I have 12 classes in total but it produces 10/11/12 classes when batches are yielded). This is causing an…

python deep-learning imbalanced-data imblearn

asked Aug 25 '21 at 04:29

Lasven Loke

votes

1 answer

BERT classification on imbalanced or small dataset

I have a large corpus, no labels. I trained this corpus to get my BERT tokenizer. Then I want to build a BertModel to do a binary classification on a labeled dataset. However, this dataset is highly imbalanced, 1: 99. So my question is: Does…

bert-language-model imbalanced-data

asked Jul 25 '21 at 04:53

duoduolikes

votes

1 answer

Loss function for binary classification with problem of data imbalance

I try to segment of multiple sclerosis lesions in MR images using deep convolutional neural networks with keras. In this task, each voxel must be classified, either as a lesion voxel or healthy voxel. The challenge of this task is data imbalance…

tensorflow image-segmentation loss-function semantic-segmentation imbalanced-data

asked Jul 20 '21 at 07:08

NahidEbrahimian

votes

1 answer

Undersampling before or after Train/Test Split

I have a credit card dataset with 98% transactions are Non-Fraud and 2% are fraud. I have been trying to undersample the majotrity class before train and test split and get very good recall and precision on the test set. When I do the undersampling…

machine-learning classification resampling imbalanced-data

asked Feb 09 '21 at 13:34

Vardaan Khanted

votes

1 answer

Remove rows with more than percentage of missing data for majority class samples only

Similar to this post, I am removing rows with >50% missing data to get a more reliable and complete dataset # Keep only the rows with at least x% non-NA values # calculate threshold numOfFeatures=38 # num of features in…

python pandas dataframe imbalanced-data

asked Jan 18 '21 at 13:08

sums22

1,793
3
13
25

Prev 1 2

…

23 24 Next