Questions tagged [imbalanced-data]

Problem definition

Imbalanced data occurs in when:

  • "The user assigns more importance to the predictive performance... on a subset of the target variable domain."
  • "[T]he cases that are more relevant for the user are poorly represented in the training set."

Paula Branco, Luís Torgo, and Rita P. Ribeiro. (2016) A Survey of Predictive Modeling on Imbalanced Domains. ACM Computing Surveys, Volume 49, Issue 2.

Software

Related Tags and Techniques

351 questions
0
votes
0 answers

Precision significantly drops when using entire dataset to test a classifier trained on undersampled data

I'm doing the Kaggle Credit Card Fraud Detection. There is a significant imbalance between Class = 1 (fraudulent transaction) and Class = 0 (not fraudulent). To compensate, I undersampled the data so that there was a 1:1 ratio between fraudulent…
0
votes
0 answers

How to do an evaluation of Logistic Regression with imbalanced dataset using sklearn?

I make Logistic Regression using python scikit-learn. I have an imbalanced dataset with 2/3 of datapoints having label y=0 and 1/3 having label y=1. I do a stratified splitting: X_train, X_test, y_train, y_test = train_test_split(X, y,…
0
votes
1 answer

dealing with imbalanced classification data?

I am building a predictive model, on which I predict if a client will subscribe again or not. I already have the dataset and the problem is that it is imbalanced ( the NOs are more then the YESs). I believe that my model is biased, but when I check…
Yassire
  • 15
  • 4
0
votes
1 answer

Text classification with imbalanced data

Am trying to classify 10000 samples of text into 20 classes. 4 of the classes have just 1 sample each, I tried SMOTE to address this imbalance, but I am unable to generate new samples for classes that have only one record, though I could generate…
0
votes
1 answer

Is it feasible to have the training set < the test set after undersampling the majority class?

I have a data set of 1500 records with two classes which are imbalanced. Class 0 is 1300 records while Class 1 is 200 records, hence a ratio of ard 6.5:1. I built a random forest with this data set for classification. I know from past experience, if…
0
votes
0 answers

predict with scaled test data or not?

I have an imbalanced classification problem. first, I want to scale the data, then resample it by SMOTE. For preventing data leakage I used a pipeline. My code is: X_train, X_test, Y_train, y_test = train_test_split(X, y, test_size = 0.20,…
0
votes
1 answer

Imbalaced-learn doesn't work even it has been installed

This is odd, I'm using 3.7 python and my libraries the dependent to imbalanced-learn is satistied too. However, when I import the library in Jupyter, it produces error. Can anyone please advice? --> 13 from imblearn import FunctionSampler 14…
0
votes
0 answers

Imbalanced Class Learning

I'm dealing with a imbalanced class classification problem in which i have imbalanced ratio as 0:1 = 717.26:1. I tried many models out of which i found GBM working best for my case. Than i came across a research paper and an article to deal with…
0
votes
1 answer

How to handle Imbalanced Datatset and outliers in python?

I have 2 doubts : If we have a classification problem with a dataframe that has large no of features (columns > 100) and if say 20/30 of them are highly correlated and the target columns (y) is very skewed towards one class ; should we first…
Ajay Alex
  • 473
  • 4
  • 7
0
votes
1 answer

F-Score difference between cross_val_score and StratifiedKFold

I want to use a Random Forest Classifier on imbalanced data where X is a np.array representing the features and y is a np.array representing the labels (labels with 90% 0-values, and 10% 1-values). As I was not sure how to do stratification within…
0
votes
1 answer

imbalanced dataset with Keras deep learning

I have a datasets that looks like this: Training (Class 0: 471, Class 1: 986) Testing (Class 0: 177, Class 1: 246. I split my data as 80% for training and 20% for validation. I know that is an imbalanced dataset, and I have tried Class_weight but…
0
votes
2 answers

SMOTE-NC in R. No packages found

I have a dataset with 5 nominal and 37 categorical variables. I want to perform oversampling in R. However, with SMOTE, I cannot do so. I looked for SMOTE-NC as advised by (Chawla, Bowyer and Hall, 2002), but I could not find any package supporting…
0
votes
1 answer

Using SMOTEENN in GridSearchCV Pipeline with Preprocesing

I am working on a classification problem with a highly imbalanced dataset. I am trying to use SMOTEENN in the grid search pipeline, however I keep getting this ValueError: ValueError: Invalid parameter randomforestclassifier for estimator…
0
votes
1 answer

Adjust predicted probability after smote

I have an imbalance data set and I used smote to oversample the minority class and undersample the majority class. now, I want to check the test AUC using predict_proba of the model. I have two questions: 1. Do I have to correct the probability if I…
anat
  • 705
  • 2
  • 7
  • 20
0
votes
1 answer

What is the set of negative data points for each classifier when using OneVsRest classification in scikit-learn?

I am trying to train a OneVsAll multiclass logistic regression model using sklearn.linear_model.LogisticRegression(multiclass='ovr'). My dataset has over 1000 classes and 2 million training examples. From what I understood was that this method will…