Questions tagged [imblearn]

Python Imbalanced learning package. To improve results or speed of learning process in Machine Learning algorithms on datasets where one or more of the classes has significantly less / more training examples you can use imbalanced learning approach. Imbalanced learning methods use re-sampling techniques like SMOTE, ADASYN, Tomek links, and their various combinations.

imbalanced-learn is a python package offering a number of re-sampling techniques commonly used in datasets showing strong between-class imbalance. It is compatible with scikit-learn and is part of scikit-learn-contrib projects.

Most classification algorithms will only perform optimally when the number of samples of each class is roughly the same. Highly skewed datasets, where the minority is heavily outnumbered by one or more classes, have proven to be a challenge while at the same time becoming more and more common.

One way of addressing this issue is by re-sampling the dataset as to offset this imbalance with the hope of arriving at a more robust and fair decision boundary than you would otherwise.

Re-sampling techniques are divided in two categories:

    Under-sampling the majority class(es).
    Over-sampling the minority class.
    Combining over- and under-sampling.
    Create ensemble balanced sets.

Below is a list of the methods currently implemented in this module.

Under-sampling

  1. Random majority under-sampling with replacement
  2. Extraction of majority-minority Tomek links
  3. Under-sampling with Cluster Centroids
  4. NearMiss-(1 & 2 & 3)
  5. Condensed Nearest Neighbour
  6. One-Sided Selection
  7. Neighboorhood Cleaning Rule
  8. Edited Nearest Neighbours
  9. Instance Hardness Threshold
    1. Repeated Edited Nearest Neighbours
    2. AllKNN

Over-sampling 12. Random minority over-sampling with replacement 13. SMOTE - Synthetic Minority Over-sampling Technique 14. bSMOTE(1 & 2) - Borderline SMOTE of types 1 and 2 15. SVM SMOTE - Support Vectors SMOTE 16. ADASYN - Adaptive synthetic sampling approach for imbalanced learning

  1. Over-sampling followed by under-sampling

    • SMOTE + Tomek links
    • SMOTE + ENN
  2. Ensemble classifier using samplers internally

    • EasyEnsemble
    • BalanceCascade
    • Balanced Random Forest
    • Balanced Bagging

Resources:

205 questions
3
votes
2 answers

I am trying to make my data balanced as my target variable has multi-class and I want to oversample it to make my data balanced

Let x contain the variables: print(x) Restaurant Cuisines Average_Cost Rating Votes Reviews Area 0 3.526361 0.693147 5.303305 1.504077 2.564949 1.609438 7.214504 1 1.386294 4.127134 4.615121 …
3
votes
1 answer

Output of shape for training after oversampling with imbalanced-learn

I am using imbalanced-learn to oversample my data. I want to know how many entries in each class there are after using the oversampling method. This code works nicely: import imblearn.over_sampling import SMOTE from collections import Counter def…
Christoph H.
  • 173
  • 1
  • 14
3
votes
3 answers

resampling data - using SMOTE from imblearn with 3D numpy arrays

I want to resample my dataset. This consists in categorical transformed data with labels of 3 classes. The amount of samples per class are: counts of class A: 6945 counts of class B: 650 counts of class C: 9066 TOTAl samples: 16661 The data shape…
sanchezjAI
  • 91
  • 1
  • 9
3
votes
1 answer

Python oversampling combine several samplers in a pipeline

My issue concerns the Value Error raised by SMOTE class. Expected n_neighbors <= n_samples, but n_samples = 1, n_neighbors = 6 # imbalanced learn is a package containing impelementation of SMOTE from imblearn.over_sampling import SMOTE, ADASYN,…
3
votes
0 answers

How to use (imblearn.keras import BalancedBatchGenerator) with more than two dem array X_train?

I'm building a CNN model trained on imbalanced dataset using Keras. I'm working on data re-sampling using imblearn.keras.balanced_batch_generator provided by imblearn. My x_train array is of shape (n_samples, 32, 32, 1) while fit_generator for…
user2340286
  • 69
  • 1
  • 6
3
votes
4 answers

How to use SMOTENC inside pipeline (Error: Some of the categorical indices are out of range)?

I would greatly appreciate if you could let me know how to use SMOTENC. I wrote: # Data XX = pd.read_csv('Financial Distress.csv') y = np.array(XX['Financial Distress'].values.tolist()) y = np.array([0 if i > -0.50 else 1 for i in y]) Na =…
ebrahimi
  • 912
  • 2
  • 13
  • 32
3
votes
3 answers

Using imblearn for oversampling multi class data

I want to use RandomOverSampler function from imbalanced-learn module to perform oversampling the data with more than two classes. The following is my code with 3 classes: import numpy as np from imblearn.over_sampling import RandomOverSampler data…
starrr
  • 1,013
  • 1
  • 17
  • 48
2
votes
0 answers

How to implement undersampling techniques like NearMiss, TomekLinks, ClusterCentroids, ENN using PySpark?

I'm trying to work on a Fraud Detection dataset from kaggle Credit Card Transactions Fraud Detection Dataset I'm working on PySpark and wish to apply Undersampling techniques using PySpark. However, I can't find any articles or documentations that…
2
votes
1 answer

Using pipeline, SMOTE, and GridSearchCV together

I write this code: LR=LogisticRegression() pipe_lr= Pipeline ([ ('oversampling', SMOTE()), ('LR', LR) ]) C_list_lr=[0.001, 0.01, 0.1, 1, 10, 100 ] solver_list_lr=[ 'liblinear', 'newton-cg', 'saga'] penalty_list_lr=[None, 'elasticnet',…
2
votes
1 answer

Importing SMOTE raise AttributeError: module 'sklearn.metrics._dist_metrics' has no attribute 'DistanceMetric32'

Running from imblearn.over_sampling import SMOTE will raise following error. --------------------------------------------------------------------------- AttributeError Traceback (most recent call last) d:\A\OneDrive -…
Leo
  • 93
  • 1
  • 8
2
votes
1 answer

ValueError: Found array with dim 3. Estimator expected <= 2 during RandomUndersampling

For one of my datasets, I have a data imbalance problem as the minority class has very few samples compared to the majority class. So I want to balance the data by undersampling the majority class. When I am trying to use RandomUnderSamples from…
upendra
  • 2,141
  • 9
  • 39
  • 64
2
votes
3 answers

AttributeError: module 'sklearn.metrics._dist_metrics' has no attribute 'DatasetsPair'

I'm trying to balanced my data on jupyter-notebook, using SMOTE: from imblearn import over_sampling from imblearn.over_sampling import SMOTE balanced = SMOTE() x_balanced , y_balanced = balanced.fit_resample(X_train,y_train) but I'm getting the…
omerk
  • 23
  • 1
  • 4
2
votes
0 answers

Complex nesting within imblearn pipelines

I have been trying to find a solution to this but unsuccessfully so far. I am working with some data for which I need to adopt a resampling procedure within a (scikit-learn/imblearn) pipeline, meaning that the size of both the samples and targets…
DrMaga
  • 21
  • 1
2
votes
0 answers

Nested pipelines in Imbalanced-Learn

This minimal code works fine for scikit-learn pipeline : inner_pipe = Pipeline(steps=[('scaler', StandardScaler())]) my_pipeline = Pipeline( steps=[('pre_step', inner_pipe), ('rfc', RandomForestClassifier())]) but if the used…
abdelgha4
  • 351
  • 1
  • 16
2
votes
0 answers

Balanced batch generator returns inconsistent class number

I am using imblearn.keras.balanced_batch_generator in my CNN classification task. But the generator produces inconsistent classes for my data (I have 12 classes in total but it produces 10/11/12 classes when batches are yielded). This is causing an…
1 2
3
13 14