Questions tagged [imblearn]

Python Imbalanced learning package. To improve results or speed of learning process in Machine Learning algorithms on datasets where one or more of the classes has significantly less / more training examples you can use imbalanced learning approach. Imbalanced learning methods use re-sampling techniques like SMOTE, ADASYN, Tomek links, and their various combinations.

imbalanced-learn is a python package offering a number of re-sampling techniques commonly used in datasets showing strong between-class imbalance. It is compatible with scikit-learn and is part of scikit-learn-contrib projects.

Most classification algorithms will only perform optimally when the number of samples of each class is roughly the same. Highly skewed datasets, where the minority is heavily outnumbered by one or more classes, have proven to be a challenge while at the same time becoming more and more common.

One way of addressing this issue is by re-sampling the dataset as to offset this imbalance with the hope of arriving at a more robust and fair decision boundary than you would otherwise.

Re-sampling techniques are divided in two categories:

    Under-sampling the majority class(es).
    Over-sampling the minority class.
    Combining over- and under-sampling.
    Create ensemble balanced sets.

Below is a list of the methods currently implemented in this module.

Under-sampling

Random majority under-sampling with replacement
Extraction of majority-minority Tomek links
Under-sampling with Cluster Centroids
NearMiss-(1 & 2 & 3)
Condensed Nearest Neighbour
One-Sided Selection
Neighboorhood Cleaning Rule
Edited Nearest Neighbours
Instance Hardness Threshold
1. Repeated Edited Nearest Neighbours
2. AllKNN

Over-sampling 12. Random minority over-sampling with replacement 13. SMOTE - Synthetic Minority Over-sampling Technique 14. bSMOTE(1 & 2) - Borderline SMOTE of types 1 and 2 15. SVM SMOTE - Support Vectors SMOTE 16. ADASYN - Adaptive synthetic sampling approach for imbalanced learning

Over-sampling followed by under-sampling
- SMOTE + Tomek links
- SMOTE + ENN
Ensemble classifier using samplers internally
- EasyEnsemble
- BalanceCascade
- Balanced Random Forest
- Balanced Bagging

Resources:

205 questions

votes

2 answers

I am trying to make my data balanced as my target variable has multi-class and I want to oversample it to make my data balanced

Let x contain the variables: print(x) Restaurant Cuisines Average_Cost Rating Votes Reviews Area 0 3.526361 0.693147 5.303305 1.504077 2.564949 1.609438 7.214504 1 1.386294 4.127134 4.615121 …

asked Nov 15 '19 at 07:17

Karndeep Singh

votes

1 answer

Output of shape for training after oversampling with imbalanced-learn

I am using imbalanced-learn to oversample my data. I want to know how many entries in each class there are after using the oversampling method. This code works nicely: import imblearn.over_sampling import SMOTE from collections import Counter def…

python python-3.x scikit-learn oversampling imblearn

asked Jul 02 '19 at 15:14

Christoph H.

votes

3 answers

resampling data - using SMOTE from imblearn with 3D numpy arrays

I want to resample my dataset. This consists in categorical transformed data with labels of 3 classes. The amount of samples per class are: counts of class A: 6945 counts of class B: 650 counts of class C: 9066 TOTAl samples: 16661 The data shape…

python numpy imblearn

asked May 14 '19 at 07:48

sanchezjAI

votes

1 answer

Python oversampling combine several samplers in a pipeline

My issue concerns the Value Error raised by SMOTE class. Expected n_neighbors <= n_samples, but n_samples = 1, n_neighbors = 6 # imbalanced learn is a package containing impelementation of SMOTE from imblearn.over_sampling import SMOTE, ADASYN,…

python machine-learning scikit-learn oversampling imblearn

asked May 07 '19 at 16:15

Alibek Jakupov

votes

0 answers

How to use (imblearn.keras import BalancedBatchGenerator) with more than two dem array X_train?

I'm building a CNN model trained on imbalanced dataset using Keras. I'm working on data re-sampling using imblearn.keras.balanced_batch_generator provided by imblearn. My x_train array is of shape (n_samples, 32, 32, 1) while fit_generator for…

python keras downsampling imblearn

asked Apr 16 '19 at 03:51

user2340286

votes

4 answers

How to use SMOTENC inside pipeline (Error: Some of the categorical indices are out of range)?

I would greatly appreciate if you could let me know how to use SMOTENC. I wrote: # Data XX = pd.read_csv('Financial Distress.csv') y = np.array(XX['Financial Distress'].values.tolist()) y = np.array([0 if i > -0.50 else 1 for i in y]) Na =…

python python-3.x scikit-learn valueerror imblearn

asked Jan 24 '19 at 08:47

ebrahimi

votes

3 answers

Using imblearn for oversampling multi class data

I want to use RandomOverSampler function from imbalanced-learn module to perform oversampling the data with more than two classes. The following is my code with 3 classes: import numpy as np from imblearn.over_sampling import RandomOverSampler data…

python scikit-learn imblearn

asked Aug 06 '17 at 00:06

starrr

1,013
1
17
48

votes

0 answers

How to implement undersampling techniques like NearMiss, TomekLinks, ClusterCentroids, ENN using PySpark?

I'm trying to work on a Fraud Detection dataset from kaggle Credit Card Transactions Fraud Detection Dataset I'm working on PySpark and wish to apply Undersampling techniques using PySpark. However, I can't find any articles or documentations that…

apache-spark pyspark apache-spark-ml imbalanced-data imblearn

asked Apr 28 '23 at 13:07

Sumit

votes

1 answer

Using pipeline, SMOTE, and GridSearchCV together

I write this code: LR=LogisticRegression() pipe_lr= Pipeline ([ ('oversampling', SMOTE()), ('LR', LR) ]) C_list_lr=[0.001, 0.01, 0.1, 1, 10, 100 ] solver_list_lr=[ 'liblinear', 'newton-cg', 'saga'] penalty_list_lr=[None, 'elasticnet',…

machine-learning scikit-learn logistic-regression gridsearchcv imblearn

asked Jan 01 '23 at 19:02

Deniss

votes

1 answer

Importing SMOTE raise AttributeError: module 'sklearn.metrics._dist_metrics' has no attribute 'DistanceMetric32'

Running from imblearn.over_sampling import SMOTE will raise following error. --------------------------------------------------------------------------- AttributeError Traceback (most recent call last) d:\A\OneDrive -…

python imblearn

asked Dec 08 '22 at 20:54

Leo

votes

1 answer

ValueError: Found array with dim 3. Estimator expected <= 2 during RandomUndersampling

For one of my datasets, I have a data imbalance problem as the minority class has very few samples compared to the majority class. So I want to balance the data by undersampling the majority class. When I am trying to use RandomUnderSamples from…

python-3.x numpy imblearn

asked Nov 01 '22 at 01:51

upendra

2,141
9
39
64

votes

3 answers

AttributeError: module 'sklearn.metrics._dist_metrics' has no attribute 'DatasetsPair'

I'm trying to balanced my data on jupyter-notebook, using SMOTE: from imblearn import over_sampling from imblearn.over_sampling import SMOTE balanced = SMOTE() x_balanced , y_balanced = balanced.fit_resample(X_train,y_train) but I'm getting the…

python scikit-learn jupyter imblearn smote

asked May 24 '22 at 16:43

omerk

votes

0 answers

Complex nesting within imblearn pipelines

I have been trying to find a solution to this but unsuccessfully so far. I am working with some data for which I need to adopt a resampling procedure within a (scikit-learn/imblearn) pipeline, meaning that the size of both the samples and targets…

machine-learning scikit-learn pipeline imblearn

asked Jan 03 '22 at 14:17

DrMaga

votes

0 answers

Nested pipelines in Imbalanced-Learn

This minimal code works fine for scikit-learn pipeline : inner_pipe = Pipeline(steps=[('scaler', StandardScaler())]) my_pipeline = Pipeline( steps=[('pre_step', inner_pipe), ('rfc', RandomForestClassifier())]) but if the used…

python scikit-learn nested pipeline imblearn

asked Sep 01 '21 at 01:24

abdelgha4

votes

0 answers

Balanced batch generator returns inconsistent class number

I am using imblearn.keras.balanced_batch_generator in my CNN classification task. But the generator produces inconsistent classes for my data (I have 12 classes in total but it produces 10/11/12 classes when batches are yielded). This is causing an…

python deep-learning imbalanced-data imblearn

asked Aug 25 '21 at 04:29

Lasven Loke

Prev 1 2

…

13 14 Next