Questions tagged [imbalanced-data]

Problem definition

Imbalanced data occurs in when:

  • "The user assigns more importance to the predictive performance... on a subset of the target variable domain."
  • "[T]he cases that are more relevant for the user are poorly represented in the training set."

Paula Branco, Luís Torgo, and Rita P. Ribeiro. (2016) A Survey of Predictive Modeling on Imbalanced Domains. ACM Computing Surveys, Volume 49, Issue 2.

Software

Related Tags and Techniques

351 questions
0
votes
0 answers

Fit the model using entire data or from training data?

I am given two data. Firstly, the train data with known class (target) Secondly, the test data with no class (no target) I split the training data into train set and validation set . I oversample the train data and test it on my validation set. It…
0
votes
0 answers

Why we cannot calculate an ROC curve in cost sensitive learning?

In the Applied Predictive Modeling book, cost sensitivity learning approach, the author(s) write: One consequence of this approach is that class probabilities cannot be generated for the model, at least in the available implementation. Therefore we…
0
votes
1 answer

Poorly calibrated probabilities but good classification in confusion matrix

I have an imbalanced data set. My goal is to balance sensitivity and specificity via the confusion matrix. I used glmnet in r with class weights. The model does well at balancing the sensitivity/specificity, but I looked at the calibration plot, and…
0
votes
2 answers

imbalabced data set score after smote

Is it correct to use 'accuracy' as a metric for an imbalanced data set after using oversampling methods such as SMOTE or we have to use other metrics such as AUROC or other presicion-recall related metrics?
0
votes
1 answer

How to handle unblanced labels in Multilabel Classification?

These oversimplified example target vectors (in my use case each 1 represents a product that a client bought at least once a…
0
votes
1 answer

How to handle input value error when using under sampling methods from imblearn?

Thank you for your help in advance. I am trying to use the RandomUnderSampler() and fit_sample() methods from imblearn to balance a botnet dataset with two missing values. The dataset contains a label column for binary classification that uses 0 and…
Ghada
  • 1
  • 1
0
votes
0 answers

SMOTENC for imbalanced multiclass classification using a pipeline gives nan value

I am using a dataset with null values and also a mix of categorical and continuous data. Initially, I replaced the null values in certain columns and then used the SMOTENC in the pipeline with stratifiedKfold ..the accuracy and ROC score is always…
0
votes
0 answers

Using Class Weights or Sample Weights for One-Hot Encoded labels with keras models

I want to use class weights or sample weights to balance my data during model training. My dataset is an images dataset where we have 20 classes in total. The dataset is highly imbalanced. I have created a data loader that loads multiple-images…
0
votes
0 answers

Using SMOTE for BERT inputs

I have some imbalanced data which I need to classify. I want to use SMOTE to balance it. But I don't really understand how to use it since I have BERT multiple inputs. Do I need to use it for input_ids? Or attention_masks? Or both? Also, a piece of…
atlas
  • 11
  • 1
0
votes
1 answer

Oversampling on binary classification

everyone. I am doing a binary classification on a huge dataset (190 columns, 500K records). The target values are 0 and 1. However, when I do the oversampling with SMOTE, new target values in the y-vector are created (0, 1, 2 for example). I do not…
0
votes
1 answer

Imbalanced classification with xgboost in python with scale_pos_weight not working properly

I am using xgboost with python in order to perform a binary classification in which the class 0 appears roughly 9 times more frequently than the class 1. I am of course using scale_pos_weight=9. However, when I perform the prediction on the testing…
donut
  • 628
  • 2
  • 9
  • 23
0
votes
1 answer

When I use imblearn pipeline instead of sklearn pipeline all textual features disappear. Any solution?

This is my code below, I need to use SMOTENC to balance the dataset, which means I have to use the pipeline from the imblearn library. However, it does not recognize the CountVectorizer features from imblearn.pipeline import Pipeline # from…
0
votes
1 answer

Can I use RandomUnderSampler for categorical data as well?

AFAIK, unlike SMOTE, RandomUnderSampler selects a subset of the data. But I am not quite confident to use it for categorical data. So, is it really applicable for categorical data?
user14596364
0
votes
1 answer

Improving performance result of classification for severely imbalance data having abnormal skewed distribution

I have a large dataset D which I balanced using under sampling method called RandomUnderSampler from imblearn package which reduce the class data with majority. The data have three classes: Yes (1), No (0), Unfinished (2). This is the minimal 3d…
0
votes
2 answers

'BalancedBaggingClassifier' object has no attribute 'n_features_in_'

i am working on an imbalanced multi-class dataset, i am trying to pass it into a balancedBaggingClassifier but i keep getting the error below : code: import pandas as pd dataframe = pd.read_excel('mergedDataset.xlsx') from sklearn import…