Questions tagged [imbalanced-data]

Problem definition

Imbalanced data occurs in when:

  • "The user assigns more importance to the predictive performance... on a subset of the target variable domain."
  • "[T]he cases that are more relevant for the user are poorly represented in the training set."

Paula Branco, Luís Torgo, and Rita P. Ribeiro. (2016) A Survey of Predictive Modeling on Imbalanced Domains. ACM Computing Surveys, Volume 49, Issue 2.

Software

Related Tags and Techniques

351 questions
0
votes
0 answers

performance metrics stratified cross validation

I have implemented stratified cross validation for multiclass imbalanced dataset. Im unable to calculate the average of each performance metric such as precision, recall etc. skf = StratifiedKFold(n_splits=10) lst_accu_stratified =…
0
votes
0 answers

How to find documents similar to a predefined set of documents

From big population of documents I would like to find those similar to a predefined set of documents. All documents inside the set are similar to each other, but very few documents from the population is similar to those in the set. Quite unbalanced…
0
votes
0 answers

Optimal metric for training with Class-specific masked input features and imbalanced dataset

I have a classification problem of 8-classes, which are extremely imbalanced. The input dataset consists of sequences, each of length n features, where n = 19. For each of the 8 classes, I have a prior knowledge which subset of the n features are…
0
votes
0 answers

Error: `data` and `reference` should be factors with the same levels for imbalanced class

I Used SMOTE and Tomek methods for imbalanced classes that I have. I'm trying to do boosted regression tree. It runs smoothly until I create the confusion matrix I have this error ( Error: data and reference should be factors with the same…
Hanan
  • 1
  • 1
0
votes
0 answers

CEM: Different Imbalance results in R and Stata

I am trying to replicate a cem matching from Stata in R. As a first step I want to evaluate the imbalance. In R I used the following code: vars <- c("X1PLTOT", "X1EBRSTOT", "X1MTHETK2", "X1RTHETK2", "X1DCCSTOT", "X1NRWABL", "female", "latino",…
0
votes
0 answers

Can MRMR be used for imbalanced dataset?

I tried using MRMR on a dataset that about 10% of the dataset has class '1' and the remaining 90% has class '0'. I used the MRMR code shown below with K=10. However, I realized that after using count ifs there were more rows that each selected…
ABDULMUJEEB
  • 101
  • 1
0
votes
0 answers

Use of data augmentation to achive balanced dataset

The theoretical case is that we have a binary image classification task with 70% of the data being labeled A and the other 30% are labeled B. So data augmentation is generally used to avoid overfitting and get better generalization, but can I also…
0
votes
0 answers

Is there a cost-sensitive loss function implementation in PyTorch?

I would like to implement a cost-sensitive loss function in PyTorch. My two-class training dataset is heavily imbalanced, where 75% of the data are label '0' and only 25% of the data are label '1'. I am new to PyTorch but my supervisor is adamant…
0
votes
1 answer

Stratified sampling for semantic segmentation

I have a set of images and multi-label masks (an image usually has segments of more than one class) and I would like to split it into train and validation sets. The data is imbalanced, where two of the classes appear in about 1% of the images and…
0
votes
1 answer

stratify sklearn train_test_split using dummy vector for 'stratify parameter

I want to split my data into train, val, and test sets, using the stratify parameter in the train_test_split library. I want to use a binary dummy vector (the vector name is prop) for the stratify parameter, making the test's labels proportion the…
0
votes
0 answers

ROSE() in R giving me negative samples when all values in training set are positive integers

I am oversampling my training dataset using ROSE() in R as below, and the oversampled dataset contains several negative values for columns that are meant to be strictly positive. The original training data is also positive, so I am surprised that…
DV24
  • 1
0
votes
0 answers

MiniBatches there are no samples for class label exception

I was following the first example given in Accord.Net framework's documentation here to train a multi class SVM classifier with my own dataset but during the training loop the I got an error that says: There are no samples for class label 3. Please…
0
votes
0 answers

Imbalanced data: precision and recall when the minority is negative case instead of positive case

I have an imbalanced dataset where 90% of cases having Y = 1, and 10% of cases having Y = 0. In this case, do precision and recall still apply? Because precision and recall focus on true positive (TP), which is not the case in my dataset. In my…
ycenycute
  • 688
  • 4
  • 10
  • 20
0
votes
1 answer

High AUC and 100% recall, but precision and F1 are low

I have an imbalanced dataset which has 43323 rows and 9 of them belong to 'failure' class, other rows belong to 'normal' class. I trained a classifier with 100% recall and 94.89% AUC for test data (0.75/0.25 split with stratify = y). However, the…
0
votes
1 answer

How to process "strong" imbalaced data for multi-label image classification with transfer learning

I tried myself but couldn't reach the final point that's why posting here, please guide me. I'm working in multi-label image classification and have slightly different scenarios. I have a big and significant imbalance dataset. You can see the…