Questions tagged [imbalanced-data]

Problem definition

Imbalanced data occurs in when:

  • "The user assigns more importance to the predictive performance... on a subset of the target variable domain."
  • "[T]he cases that are more relevant for the user are poorly represented in the training set."

Paula Branco, Luís Torgo, and Rita P. Ribeiro. (2016) A Survey of Predictive Modeling on Imbalanced Domains. ACM Computing Surveys, Volume 49, Issue 2.

Software

Related Tags and Techniques

351 questions
2
votes
1 answer

Python package for SMOTEBoosting algorithm

I am looking for a Python package that implements the SMOTEBoosting algorithm. But I only find SMOTE in imbalanced-learn.
tides
  • 25
  • 3
2
votes
0 answers

Threshold moving to find the best cost for imbalanced dataset classification

I need to confirm my understanding of the threshold moving process to find the best cost of misclassification (binary) for imbalanced dataset. Split data into train and test. Fit the model on train data set. Obtain the predicted probabilities for…
Vidya
  • 71
  • 5
2
votes
2 answers

'NearMiss' object has no attribute '_validate_data'

Detailed Image This is the code below which shows the error. from imblearn.under_sampling import NearMiss nm = NearMiss() X_res,y_res=nm.fit_sample(X,Y)
Yashraj Jain
  • 88
  • 2
  • 8
2
votes
0 answers

How to apply Undersampling or oversampling to a dataset in Python?

Here's the thing, I have an imbalanced data and I'm trying to use Undersampling. Perhaps people don't have the solution to my error, but if this is the case, any alternative would be appreciated. This is what I've done: from imblearn.under_sampling…
Dumb ML
  • 357
  • 2
  • 12
2
votes
1 answer

How do I apply SMOTENC to my data frame that has columns with objects and numerics?

> In: data.dtypes Out: Organization Name object Money Raised Currency (in USD) float64 Announced Date datetime64[ns] Total Funding Amount Currency (in USD) …
2
votes
1 answer

How to use class weights for GaussianNB and KNeighborsRegressor in sklearn?

I have a highly imbalanced data set from which I want to get both classification (binary) as well as probabilities. I have managed to use logistic regression as well as random forest to obtain results from cross_val_predict using class weights. I…
2
votes
2 answers

Logistic Regression - class_weight balanced vs dict argument

When using sklearn LogisticRegression function for binary classification of imbalanced training dataset (e.g., 85% pos class vs 15% neg class), is there a difference between setting the class_weight argument to 'balanced' vs setting it to {0:0.15,…
2
votes
1 answer

Correct way to do cross validation in a pipeline with imbalanced data

For the given imbalanced data , I have created a different pipelines for standardization & one hot encoding numeric_transformer = Pipeline(steps = [('scaler', StandardScaler())]) categorical_transformer = Pipeline(steps=['ohe',…
2
votes
1 answer

How to fix class imbalance in dialogue (text) time series data?

I have a dataset that looks like this: df.head(5) data labels 0 [0.0009808844009380855, 0.0008974465127279559] 1 1 [0.0007158940267629654, 0.0008202958833774329] 3 2 …
connor449
  • 1,549
  • 2
  • 18
  • 49
2
votes
0 answers

using class weights with sklearn votingClassifier

I have an imbalance dataset for a classification problem. My target variable is binary and has two category. I implemented Random Forest and Logistic Regression by assigning class_weights as parameter. When I fit data to random forest and logistic…
2
votes
1 answer

F1 - score with imbalanced data

I am working on a binary classification task. My evaluation data is imbalanced and consists of appr. 20% from class1 and 80% from class2. Even I have good classification accuracy on each class type, as 0.602 on class1, 0.792 on class2 if I calculate…
2
votes
2 answers

Pytorch - how to undersample using weightedrandomsampler

I have an unbalanced dataset and would like to undersample the class that is overrepresented.How do I go about it. I would like to use to weightedrandomsampler but I am also open to other suggestions. So far I am assuming that my code will have to…
2
votes
1 answer

Imbalanced-learn: Import Error: cannot import name 'MultiOutputMixin'

I've re-installed the latest scikit-learn and imbalanced-learn. I've also checked all other libraries to make sure they are compatible with imbalanced-learn. I just want to run a simple RandomOverSample(), but I got the following import error…
Cassie.L
  • 311
  • 1
  • 7
  • 19
2
votes
1 answer

Deep Learning: Multiclass Classification with same amount of labels between the training dataset and test dataset

I'm writing a code for doing a multiclass classification. I have custom datasets with 7 columns (6 features and 1 label), the training dataset has 2 types of label (1 and 2), and the testing dataset has 3 types of labels (1, 2, and 3). The aim of…
2
votes
2 answers

Balance classes in cross validation

I would like to build a GBM model with H2O. My data set is imbalanced, so I am using the balance_classes parameter. For grid search (parameter tuning) I would like to use 5-fold cross validation. I am wondering how H2O deals with class balancing in…
Coco
  • 211
  • 3
  • 7