Questions tagged [imbalanced-data]

Problem definition

Imbalanced data occurs in when:

  • "The user assigns more importance to the predictive performance... on a subset of the target variable domain."
  • "[T]he cases that are more relevant for the user are poorly represented in the training set."

Paula Branco, Luís Torgo, and Rita P. Ribeiro. (2016) A Survey of Predictive Modeling on Imbalanced Domains. ACM Computing Surveys, Volume 49, Issue 2.

Software

Related Tags and Techniques

351 questions
0
votes
0 answers

how to make a soft accuracy and loss curves in deep learning models

There is an imbalance two class classification problem with 12750 samples for class 0 and 2550 samples for class 1. I've gotten class weights using class_weight.compute_class_weight and fed them to model.fit. I've tested many loss and optimizer…
afrah
  • 31
  • 8
0
votes
0 answers

Distinct SVM models giving exactly the same results in R

I'm comparing the predictive power between two Support Vector Machine models in R. I have 6 response variables (categorical) and 24 predictor variables. In one of the models I'm using my data with unbalance between the response variables and in…
0
votes
1 answer

Imbalanced multiclass classification dataset: undersample or oversample?

Dataset has around 150k records with four labels: ['A','B','C','D'] and the distribution is as follows: A: 60000 B: 50000 C: 36000 D: 4000 I notice using the package classification report to get the precision, recall, and f1-score, the f1-score…
0
votes
0 answers

Are oversampling and undersampling approaches good to build good models?

I just worked on "Heart Failure Prediction" dataset from kaggle ( https://www.kaggle.com/andrewmvd/heart-failure-clinical-data ) And i noticed the number of "Not dead" were more then the number of "dead" so i used SMOTETomek, which resampled my data…
0
votes
0 answers

LightGBM fails to predict on validation set (R)

I have big troubles implementing LightGBM on a extreme imbalanced dataset (using R) Indeed, I'm dealing with a binary classification problem and the distibution of the target variable is about 1:800 ( Approx: Class 0: 110 000 Class 1: 140 ) I…
CCbs
  • 105
  • 3
0
votes
1 answer

Difference between imblearn pipeline and Pipeline

I wanted to use sklearn.pipeline instead of using imblearn.pipeline to incorporate `RandomUnderSampler()'. My original data requires missing value imputation and scaling. Here I have breast cancer data as a toy example. However, it gave me the…
0
votes
0 answers

How to deal with imbalanced datasets in neural network trainning

I am even struggling to explain in a brief but clear way my question, so I'll do my best effort to provide some background information before jumping directly into the question. Brackground I have a very imbalanced dataset that has 3 classes, which…
xerac
  • 147
  • 8
0
votes
1 answer

How to correct Python Attribute error: 'SMOTE' object has no attribute 'fit_sample'

Hello: I am trying to run the following code: os = SMOTE(random_state=0) X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=0) columns = X_train.columns os_data_X,os_data_y=os.fit_sample(X_train, y_train) But get…
JWeds
  • 3
  • 1
  • 2
0
votes
0 answers

Low G-mean and MCC for binary classification of imbalanced data

I have artificially increased the imbalance ratio to show the impact of different popular scoring metrics on the classification performance. Also, I have artificially added some missing values to see that my pipe line is working properly. However, I…
0
votes
1 answer

Appropriate f1 scoring for highly imbalanced data

I am confused with three different f1 computation. Which f1 scoring I should use for a severely imbalanced data? I am working on a severely imbalanced binary classification. ‘f1’ ‘f1_micro’ ‘f1_macro’ ‘f1_weighted’ Also, I want to add…
0
votes
0 answers

What faster alternatives to SMOTE for imbalanced large data set are there in R?

I have a training set of 260,000 observations and 30 IVs and with binary class imbalance 1:6 (yes, it does mess up models' performance), but using SMOTE isn't an option, since it takes forever on my laptop with this amount of data. Is there any…
user000
  • 123
  • 7
0
votes
1 answer

In R, how do I run a balanced 10-fold CV information gain test for feature selection on imbalanced 2-class data?

I have a large training data set data.trn of 260,000+ observations on 50+ variables , with dependent variable loan_status consisting of 2 classes "paid off" and "default" with respective imbalance of about 5:1. I want to use information.gain command…
user000
  • 123
  • 7
0
votes
0 answers

Feature reduction and class Imbalance handling which has to be performed first?

I am working on the feature extraction and class imbalance problems, but need advice on which one to perform first? Feature reduction/selection or to handle class imbalance first?
0
votes
1 answer

cannot import name 'SMOTEN' from 'imblearn.over_sampling'

SMOTE and SMOTENC is working. But unable to use SMOTEN. I tried solution in this. But still only for SMOTEN it returns the error, ImportError: cannot import name 'SMOTEN' from 'imblearn.over_sampling'. I am using Jupyter Notebook and below is the…
DOT
  • 309
  • 2
  • 11
0
votes
1 answer

Max_samples hyperparameter in PU bagging for highly imbalanced dataset

I am using the credit card fraud dataset(link below) and it's highly imbalanced where the positive class has only 492 instances and the negative class has 284315 instances. I was applying PU Bagging (link below) on it to extract hidden positives in…