Questions tagged [imbalanced-data]

Problem definition

Imbalanced data occurs in when:

  • "The user assigns more importance to the predictive performance... on a subset of the target variable domain."
  • "[T]he cases that are more relevant for the user are poorly represented in the training set."

Paula Branco, Luís Torgo, and Rita P. Ribeiro. (2016) A Survey of Predictive Modeling on Imbalanced Domains. ACM Computing Surveys, Volume 49, Issue 2.

Software

Related Tags and Techniques

351 questions
1
vote
0 answers

ImageDataGenerator.flow_from_dataframe still has problems with Overfitting

I have an image dataset of 2432 images, each with a category of a total of 3. The labels are stored in a csv file with the image id and the label (T1). The distribution of data is: negative 1695 positive 648 neutral 89 I'm trying to…
1
vote
2 answers

Under-sampling leads to poor results for no apparent reason

I am using Random Forest for a semantic segmentation task, with 3 classes, which are imbalanced. First, I just trained the algorithms on random subsets containing 20% of all the pixels (else my memory cannot handle training the algorithms), and got…
Droidux
  • 146
  • 2
  • 12
1
vote
0 answers

Using Class_weights for imbalance dataset in Mask RCNN

I have added Class_Weights to be used while training Mask RCNN on custome dataset. It is showing error : ValueError: Unknown entries in class_weight dictionary: [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15]. Only expected following keys:…
Tima
  • 11
  • 1
1
vote
1 answer

Imblearn pipeline with SMOTE step - AttributeError: This 'Pipeline' has no attribute 'transform'

As part of an assignment, I have been trying to wipe up a pipeline to preprocess some data that I have. Said data has a total of five classes, one of which is imbalanced compared to the others, and therefore I decided to apply SMOTE for it. The code…
1
vote
0 answers

Using Sample_weight on test set

I am using XGBoost for an imbalanced dataset ( ratio of positive samples to negatives is 1/14). I used the sklearn.utils.class_weight.compute_sample_weight to set sample_weight parameter. To report the results on test data to my team, I did the same…
SaD
  • 63
  • 4
1
vote
1 answer

what is the correct way to apply a feature selection method to an imbalanced dataset?

I am new to data science & machine learning, so I'll write my question in detail. I have an imbalanced dataset (binary classification dataset), and I want to apply these methods by using Weka paltform: 10-Fold cross validation. Oversampling to…
1
vote
1 answer

How to combine X_train and y_train into one balanced dataframe in Pyhton?

I would highly appreciate your advise with this: I have imbalanced dataset: y has only 2% of 1. I want to balance only the train dataset and afterwards to perform on the balanced train dataset feature selection prior to the model. After performing…
Ella
  • 13
  • 4
1
vote
2 answers

StratifiedKFold and Over-Sampling together

I have a machine learning model and a dataset with 15 features about breast cancer. I want to predict the status of a person (alive or dead). I have 85% alive cases and only 15% dead. So, I want to use over-sampling for dealing with this problem and…
1
vote
0 answers

Detectron2 - How to use RepeatFactorTrainingSampler in class imbalance problem

I'm facing class imbalance problem in a 2 classes classification problem. Generally class 1 is about 25% of class 2. (class 1= 100 observations, class 2 = 400) So i will need class 1 to have 4x times more than the current observation. With the…
kwc
  • 21
  • 4
1
vote
1 answer

which one should I used for calculating F1 score in pytorch?

I am using pretriend model to classify two classes and i want to compute the F1 score and weghted F1 score and first I calculate the precision and Sensitivity based on the results of CM and claculate the F1 score and i get good results ,but i try…
Manar Saad
  • 31
  • 3
1
vote
0 answers

WeightedRandomSampler with multi-dimensional batch

I'm working on a classification problem (100 classes) and my dataset has a huge class imbalance. To tackle this, I'm considering using torch's WeightedRandomSampler to oversample the minority class. I took help from this post which seemed pretty…
1
vote
1 answer

learning curve of multiclass classification task

I'm trying to do a multiclass classification using multiple machine learning using this function that I have created: def model_roc(X, y): X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, stratify=y, random_state=11) …
1
vote
1 answer

Undersampling/Oversampling issues with onehotencoded categorical data

I am trying to fit a classification problem which has a (40000 vs 400) split between 0 and 1 class. I am trying to play around with oversampling and undersampling (not preferred) but keep running into issues. Error- Shape of passed values is (34372,…
1
vote
1 answer

Cross multiplication to equalize sample proportions

I have a larger dataset and below is a subset of that data. The category is the dependent variable and Day_1 and Day_2 are independent variables. ID <- c("e-1", "e-2", "e-3", "e-8", "e-9", "e-10", "e-13", "e-16", "e-17", "e-20") Day_1 <- c(0.58,…
Niro Mal
  • 127
  • 7
1
vote
1 answer

StratifiedKfolds with imbalanced data for multiclass classification

I'm trying to build a model of multiclass classification using imbalanced data with few samples(436) and 3 classes. After standardizing data I split it using stratifiedkfolds to be sure that my minority class is represented well on the train and…