Questions tagged [imblearn]

Python Imbalanced learning package. To improve results or speed of learning process in Machine Learning algorithms on datasets where one or more of the classes has significantly less / more training examples you can use imbalanced learning approach. Imbalanced learning methods use re-sampling techniques like SMOTE, ADASYN, Tomek links, and their various combinations.

imbalanced-learn is a python package offering a number of re-sampling techniques commonly used in datasets showing strong between-class imbalance. It is compatible with scikit-learn and is part of scikit-learn-contrib projects.

Most classification algorithms will only perform optimally when the number of samples of each class is roughly the same. Highly skewed datasets, where the minority is heavily outnumbered by one or more classes, have proven to be a challenge while at the same time becoming more and more common.

One way of addressing this issue is by re-sampling the dataset as to offset this imbalance with the hope of arriving at a more robust and fair decision boundary than you would otherwise.

Re-sampling techniques are divided in two categories:

    Under-sampling the majority class(es).
    Over-sampling the minority class.
    Combining over- and under-sampling.
    Create ensemble balanced sets.

Below is a list of the methods currently implemented in this module.

Under-sampling

  1. Random majority under-sampling with replacement
  2. Extraction of majority-minority Tomek links
  3. Under-sampling with Cluster Centroids
  4. NearMiss-(1 & 2 & 3)
  5. Condensed Nearest Neighbour
  6. One-Sided Selection
  7. Neighboorhood Cleaning Rule
  8. Edited Nearest Neighbours
  9. Instance Hardness Threshold
    1. Repeated Edited Nearest Neighbours
    2. AllKNN

Over-sampling 12. Random minority over-sampling with replacement 13. SMOTE - Synthetic Minority Over-sampling Technique 14. bSMOTE(1 & 2) - Borderline SMOTE of types 1 and 2 15. SVM SMOTE - Support Vectors SMOTE 16. ADASYN - Adaptive synthetic sampling approach for imbalanced learning

  1. Over-sampling followed by under-sampling

    • SMOTE + Tomek links
    • SMOTE + ENN
  2. Ensemble classifier using samplers internally

    • EasyEnsemble
    • BalanceCascade
    • Balanced Random Forest
    • Balanced Bagging

Resources:

205 questions
2
votes
1 answer

use imblearn to plot ROC curve

I'm trying to use imblearn to plot a ROC curve but run into some problem. here's a screenshot of my data from imblearn.over_sampling import SMOTE, ADASYN from collections import Counter import pandas as pd import numpy as np import…
yihao ren
  • 369
  • 1
  • 4
  • 15
2
votes
1 answer

Using VotingClassifier with other classifiers inside a Sklearn Pipeline

I want to use the VotingClassifier inside a sklearn Pipeline, where I defined a set of classifiers .. I got some intuition from this question: Using VotingClassifier in Sklearn Pipeline to build the code below, but in this question each of the…
Minions
  • 5,104
  • 5
  • 50
  • 91
2
votes
2 answers

How to use Random Undersampler with ratio = 'dict' in imblearn?

I am trying to deal with imbalanced data set using imblearn's random under-sampler. I want to specify the number of labels to be under-sampled manually. Here is my code: sm = RandomUnderSampler(ratio = {0:142498, 1: 495}, random_state=42) X_train,…
Saurav--
  • 1,530
  • 2
  • 15
  • 33
2
votes
0 answers

Undersampling for multilabel imbalanced datasets in pandas

I'm working on a roll-your-own undersampling function, since imblearn does not work neatly with multi-label classification (e.g. it only accepts one dimensional y). I want to iterate through X and y, removing a row every 2 or 3 rows that are part…
tw0000
  • 475
  • 1
  • 7
  • 13
1
vote
0 answers

ImageDataGenerator.flow_from_dataframe still has problems with Overfitting

I have an image dataset of 2432 images, each with a category of a total of 3. The labels are stored in a csv file with the image id and the label (T1). The distribution of data is: negative 1695 positive 648 neutral 89 I'm trying to…
1
vote
1 answer

Mitigation for imblearn pipelines

I'm trying to mitigate unfairness for a model I trained using an imblearn pipeline with ADASYN. My pipeline looks like this: loaded_model = Pipeline(steps=[('feature_scaler', StandardScaler()), ('adasyn_resampling',…
1
vote
1 answer

Imblearn pipeline with SMOTE step - AttributeError: This 'Pipeline' has no attribute 'transform'

As part of an assignment, I have been trying to wipe up a pipeline to preprocess some data that I have. Said data has a total of five classes, one of which is imbalanced compared to the others, and therefore I decided to apply SMOTE for it. The code…
1
vote
0 answers

Running sampling in scikit-learn with imblearn in parallel

I just noticed that the over-/undersampler methods from the imbalanced-learn (imblearn) package now give a future deprecation warning for running in parallel / n_jobs=x argument FutureWarning: The parameter n_jobs has been deprecated in 0.10 and…
Björn
  • 1,610
  • 2
  • 17
  • 37
1
vote
1 answer

why does installing imblearn with pip is failing?

I am trying to install the python package "imblearn" to balanace datasets, with the command pip install imblearn. but it keeps failing. trying from cmdand from PowerShell with admin privileges, with regular pip command, and with git clone to the…
Ron Keinan
  • 27
  • 4
1
vote
2 answers

StratifiedKFold and Over-Sampling together

I have a machine learning model and a dataset with 15 features about breast cancer. I want to predict the status of a person (alive or dead). I have 85% alive cases and only 15% dead. So, I want to use over-sampling for dealing with this problem and…
1
vote
1 answer

Is there a parameter for GridSearchCV to select the best with the lowest difference between train and test set?

My goal is to get good fit model (train and test set metrics differences are only 1% - 5%). This is because the Random Forest tends to overfit (the default params train set f1 score for class 1 is 1.0) The problem is, the GridSearchCV only consider…
1
vote
0 answers

SHAP with an imblearn pipeline

How can I use SHAP after using imblearn pipeline? This is my code: pipeline_adaboost = Pipeline([('smt', SMOTE(random_state=42)), ('adaboost', AdaBoostClassifier(random_state=42))]) adaboost_parameters =…
new_data
  • 11
  • 2
1
vote
1 answer

.fit : AttributeError in python3, using imblearn.ensemble and BalancedRandomForestClassifier

CODE: from imblearn.ensemble import BalancedRandomForestClassifier bal_forest = BalancedRandomForestClassifier(n_estimators=100, random_state=1) bal_forest.fit(X_train,…
1
vote
0 answers

Outlier elimination in a imblearn pipeline affecting both X and y

I aim to integrate outlier elimination into a machine learning pipeline with a continuous dependent variable. The challenge is to keep X and y at the same length, thus I have eliminate outliers in both datasets. As this task turned out to be…
1
vote
1 answer

Performing Random Under-sampling after SMOTE using imblearn

I am trying to implement combining over-sampling and under-sampling using RandomUnderSampler() and SMOTE(). I am working on the loan_status dataset. I have done the following split. X = df.drop(['Loan_Status'],axis=1).values # independant…