Questions tagged [imblearn]

Python Imbalanced learning package. To improve results or speed of learning process in Machine Learning algorithms on datasets where one or more of the classes has significantly less / more training examples you can use imbalanced learning approach. Imbalanced learning methods use re-sampling techniques like SMOTE, ADASYN, Tomek links, and their various combinations.

imbalanced-learn is a python package offering a number of re-sampling techniques commonly used in datasets showing strong between-class imbalance. It is compatible with scikit-learn and is part of scikit-learn-contrib projects.

Most classification algorithms will only perform optimally when the number of samples of each class is roughly the same. Highly skewed datasets, where the minority is heavily outnumbered by one or more classes, have proven to be a challenge while at the same time becoming more and more common.

One way of addressing this issue is by re-sampling the dataset as to offset this imbalance with the hope of arriving at a more robust and fair decision boundary than you would otherwise.

Re-sampling techniques are divided in two categories:

    Under-sampling the majority class(es).
    Over-sampling the minority class.
    Combining over- and under-sampling.
    Create ensemble balanced sets.

Below is a list of the methods currently implemented in this module.

Under-sampling

  1. Random majority under-sampling with replacement
  2. Extraction of majority-minority Tomek links
  3. Under-sampling with Cluster Centroids
  4. NearMiss-(1 & 2 & 3)
  5. Condensed Nearest Neighbour
  6. One-Sided Selection
  7. Neighboorhood Cleaning Rule
  8. Edited Nearest Neighbours
  9. Instance Hardness Threshold
    1. Repeated Edited Nearest Neighbours
    2. AllKNN

Over-sampling 12. Random minority over-sampling with replacement 13. SMOTE - Synthetic Minority Over-sampling Technique 14. bSMOTE(1 & 2) - Borderline SMOTE of types 1 and 2 15. SVM SMOTE - Support Vectors SMOTE 16. ADASYN - Adaptive synthetic sampling approach for imbalanced learning

  1. Over-sampling followed by under-sampling

    • SMOTE + Tomek links
    • SMOTE + ENN
  2. Ensemble classifier using samplers internally

    • EasyEnsemble
    • BalanceCascade
    • Balanced Random Forest
    • Balanced Bagging

Resources:

205 questions
6
votes
1 answer

python imblearn make_pipeline TypeError: Last step of Pipeline should implement fit

I am trying to implement SMOTE of imblearn inside the Pipeline. My data sets are text data stored in pandas dataframe. Please see below the code snippet text_clf =Pipeline([('vect', TfidfVectorizer()),('scale',…
pythondumb
  • 1,187
  • 1
  • 15
  • 30
5
votes
1 answer

RandomUnderSampler' object has no attribute 'fit_resample'

I am using RandomUnderSampler from imblearn, but I get the following error. Any ideas? Thanks from imblearn.under_sampling import RandomUnderSampler print('Initial dataset shape %s' % Counter(y.values.squeeze())) rus =…
hsbr13
  • 431
  • 1
  • 8
  • 21
5
votes
2 answers

SMOTE with missing values

I am trying to use SMOTE from imblearn package in Python, but my data has a lot of missing values and I got the following error: ValueError: Input contains NaN, infinity or a value too large for dtype('float64'). I checked the parameters here, and…
MJeremy
  • 1,102
  • 17
  • 27
4
votes
1 answer

TypeError: __init__() got an unexpected keyword argument 'ratio' when using SMOTE

I am using SMOTE to oversample as my dataset is imbalanced. I am getting an unexpected argument error. But in the documentation, the ratio argument is defined for SMOTE. Can someone help me understand where I am going wrong? Code snippet from…
anushiya-thevapalan
  • 561
  • 3
  • 5
  • 16
4
votes
1 answer

How to implement RandomUnderSampler in a scikit learn pipline?

I have a scikit learn pipeline to scale numeric features and encode categorical features. It was working fine until I tried to implement the RandomUnderSampler from imblearn. My goal is to implement the undersampler step since my dataset is very…
Ale M.
  • 175
  • 2
  • 3
  • 11
4
votes
3 answers

When do feature selection in imblearn pipeline with cross-validation and grid search

Currently I am building a classifier with heavily imbalanced data. I am using the imblearn pipeline to first to StandardScaling, SMOTE, and then the classification with gridSearchCV. This ensures that the upsampling is done during the…
Joost Jansen
  • 61
  • 1
  • 3
4
votes
1 answer

How to oversample image dataset using Python?

I am working on a multiclass classification problem with an unbalanced dataset of images(different class). I tried imblearn library, but it is not working on the image dataset. I have a dataset of images belonging to 3 class namely A,B,C. A has 1000…
4
votes
3 answers

Feature Importance using Imbalanced-learn library

The imblearn library is a library used for unbalanced classifications. It allows you to use scikit-learn estimators while balancing the classes using a variety of methods, from undersampling to oversampling to ensembles. My question is however, how…
mamafoku
  • 1,049
  • 2
  • 14
  • 28
4
votes
2 answers

How can I generate categorical synthetic samples with imblearn and SMOTE?

I am looking to generate synthetic samples for a machine learning algorithm using imblearn's SMOTE. I have a few categorical features which I have converted to integers using sklearn preprocessing.LabelEncoder. The problem that I have is that when…
S Hoult
  • 59
  • 1
  • 6
3
votes
4 answers

How to resolve "cannot import name '_MissingValues' from 'sklearn.utils._param_validation'" issue when trying to import imblearn?

I am trying to import imblearn into my python notebook after installing the required modules. However, I am getting the following error: Additional info: I am using a virtual environment in Visual Studio Code. I've made sure that venv was selected…
user22158562
  • 41
  • 1
  • 4
3
votes
1 answer

Why does SMOTE not work with more than 15 features / What method does work with more than 15 features?

I'm currently implementing machine learning using SMOTE from imblearn.over_sampling, and as I'm synthesizing data for it, I see a very noticeable cutoff for when the SMOTE method breaks. When I synthesize data using the following code and run it…
3
votes
7 answers

Cannot import name 'available_if' from 'sklearn.utils.metaestimators'

While importing "from imblearn.over_sampling import SMOTE", getting import error. Please check and help. I tried upgrading sklearn, but the upgrade was undone with 'OSError'. Firsty installed imbalance-learn through pip. !pip install -U…
Piyush
  • 31
  • 1
  • 1
  • 2
3
votes
1 answer

The difference between smote.fit_sample() and smote.fit_resample()

In imblearn, what is the difference between smote.fit_sample() and smote.fit_resample(), and when should we use one over the other?
Manish KC
  • 103
  • 1
  • 9
3
votes
1 answer

'RandomOverSampler' object has no attribute '_validate_data'

Hi I am getting following error can anyone suggest me what could be wrong? When I am calling, os.fit_sample(X,y) 'RandomOverSampler' object has no attribute '_validate_data'
Vikas Singh
  • 33
  • 1
  • 3
3
votes
0 answers

How to save model after sklearn gridsearchcv using imblearn pipeline : TypeError: can't pickle _thread.RLock objects

The problem i am facing is this that I have performed grid search using imblearn pipeline and using sklearn gridsearchcv as I was dealing with an extremely unbalanced dataset, but when I try to save the model , I am getting the error 'TypeError:…
1
2
3
13 14