Questions tagged [imbalanced-data]

Problem definition

Imbalanced data occurs in when:

  • "The user assigns more importance to the predictive performance... on a subset of the target variable domain."
  • "[T]he cases that are more relevant for the user are poorly represented in the training set."

Paula Branco, Luís Torgo, and Rita P. Ribeiro. (2016) A Survey of Predictive Modeling on Imbalanced Domains. ACM Computing Surveys, Volume 49, Issue 2.

Software

Related Tags and Techniques

351 questions
0
votes
0 answers

Over sampling with only nominal features, which over or undersampling techniques could be valid in this case?

I have data where all features are nominal. I applied SMOTE-NC, then I found that it only works with a combination of nominal and continuous features!. There is a technique called SMOTE-N (to deal with only nominal features) in the same paper of…
0
votes
1 answer

Ignore columns in SMOTE oversampling

I am having six feature columns and one target column, which is imbalanced. Can I make oversampling method like ADASYN or SMOTE by creating synthetic records only for the four columns X1,X2,X3,X4 by copying exactly the same as constant (Month, year…
Ayyasamy
  • 149
  • 1
  • 13
0
votes
0 answers

Evaluating Model Outcome on Test Set After Downsampling Training Data because of Class Imbalance

I'm working with an extremely class imbalanced data set (the % of positive classes is ~0.1%) and have explored a number of different sampling techniques to help improve the model performance (measured by AUPRC). Since I only have a few thousand…
shadowprice
  • 617
  • 2
  • 7
  • 14
0
votes
1 answer

meaning of weighted metrics in scikit: bigger class more weight or smaller class more weight?

I am dealing with an imbalanced dataset and tried handle it with the validation metric. In scikit docu I found the following for weighted: Calculate metrics for each label, and find their average weighted by support (the number of true instances…
nopact
  • 195
  • 2
  • 12
0
votes
1 answer

How to undersample/oversample more than two classes' dataset using "imblearn" library in Python?

I am working with "imblearn" library for undersampling. I have four classes in my dataset each having 20, 30, 40 and 50 number of data(as it is an imbalanced class). But when I try to undersample the dataset using "fit_resample(X, y)", it only…
Rawnak Yazdani
  • 1,333
  • 2
  • 12
  • 23
0
votes
1 answer

How can I reshape (120, 100, 100) shaped image data to (120, 10000) shape to undersample using "imblearn" library of Python?

I am working with imblearn library of Python for undersampling. Necessary code: undersample = RandomUnderSampler(sampling_strategy='majority') X_under, y_under = undersample.fit_resample(X, y) Here X is my image dataset & of (120, 100, 100) shape…
Rawnak Yazdani
  • 1,333
  • 2
  • 12
  • 23
0
votes
1 answer

How to pass an argument to a function within a customized function?

First of all the code snippet: ## Packages from sklearn.svm import SVC from sklearn.model_selection import train_test_split from sklearn.metrics import fbeta_score from imblearn.over_sampling import RandomOverSampler from sklearn.datasets import…
RazorLazor
  • 71
  • 5
0
votes
0 answers

R imbalance package Error in Ops.data.frame(dataset[, classAttr], minorityClass) ‘==’ only defined for equally-sized data frames

So whenever I try to use some imbalance function on my dataset I get this error: Error in Ops.data.frame(dataset[, classAttr], minorityClass) : ‘==’ only defined for equally-sized data frames This is my code: dset <-…
DolceVita34
  • 115
  • 1
  • 1
  • 6
0
votes
0 answers

Creation of synthetic data - Balance a dataset

I'm analyzing the Pokemon's dataset. I´d like to create a random forest to predict whether a Pokemon can be legendary or not. Right now, I have a training dataset formed by 118 observations and 44 columns: variables: $ type1_bug : int 0 0…
Panri93
  • 213
  • 1
  • 10
0
votes
1 answer

renormalizing class weights for imbalanced data

i have a set of imbalanced data for training on a CNN neural net. i want to calculate class weights that will be proportional to the frequency of each label, such that labels that are less frequent will be enhanced when calculating the…
0
votes
1 answer

Steps for a highly imbalanced classification steps. Should I up-sample & under-sample data or just up-sample the imbalanced class

I have a highly imbalanced binary (yes/no) classification dataset. The dataset currently has appx 0.008% 'yes'. I need to balance the dataset using SMOTE. I came across 2 method to deal with the imbalance. The following steps after I have run…
John Doe
  • 637
  • 2
  • 7
  • 14
0
votes
2 answers

F1 score reduced after using class weight

I am working on a multi class classification use case and the data is highly imbalanced. By highly imbalanced data I mean that there is a huge difference between class with maximum frequency and the class with minimum frequency. So if I go ahead…
0
votes
0 answers

TypeError: 'int' object is not subscriptable (imblearn generator)

I am dealing with imbalanced text-based dataset. I used tensorflow balanced batch generator to create a balanced batch when training a model as follow: batch_generator, steps_per_epoch = balanced_batch_generator(training_x, training_y, BATCH, …
Elham
  • 827
  • 2
  • 13
  • 25
0
votes
0 answers

How do I double data for classes which have less number of images compare to other classes?

My training data is imbalanced. So I decided to resample my dataset. I want to do slightly changes while resampling. I'd like to apply a horizontal flip and Gaussian filter to minority classes to make all classes equal. To do so, I'd like to use…
0
votes
0 answers

Imbalanced text classification by oversampling: correction class probability

My dataset has 3 class and 900 examples for training. Class distribution is 220, 185, and 500. I found that if I oversample the training data then I have to correct/calibrate the predicted probability of the test data because after oversampling the…
user3363813
  • 567
  • 1
  • 5
  • 19