Questions tagged [imbalanced-data]

Problem definition

Imbalanced data occurs in when:

  • "The user assigns more importance to the predictive performance... on a subset of the target variable domain."
  • "[T]he cases that are more relevant for the user are poorly represented in the training set."

Paula Branco, Luís Torgo, and Rita P. Ribeiro. (2016) A Survey of Predictive Modeling on Imbalanced Domains. ACM Computing Surveys, Volume 49, Issue 2.

Software

Related Tags and Techniques

351 questions
0
votes
1 answer

The reason of different results of KNN algorithm from PYOD & Sklearn packages

Beside this post, I experimented with KNN algorithms using sklearn and PYOD packages for unsupervised approach on benchmark dataset for anomaly detection task and I get different…
Mario
  • 1,631
  • 2
  • 21
  • 51
0
votes
0 answers

binary search tree_ how to update and calculate the imbalance_python

I am building a binary search tree, and I want to update the imbalance when I add a child and use this function in the add_child function. But now I have met some problem, can someone tell me where is wrong? Thank you very much! And it is correct to…
0
votes
1 answer

Get error: unexpected keyword argument 'random_state' when using TomekLinks

My code is: undersample = TomekLinks(sampling_strategy='majority', n_jobs= -1, random_state = 42) X_tl, y_tl = undersample.fit_resample(X, y) When I run it, I get this error: TypeError: __init__() got an unexpected keyword argument…
Amit S
  • 225
  • 6
  • 16
0
votes
1 answer

Warning Message in binary classification model Gaussian Naive Bayes?

I am using a multiclass classification-ready dataset with 14 continuous variables and classes from 1 to 10. This is the data file: https://drive.google.com/file/d/1nPrE7UYR8fbTxWSuqKPJmJOYG3CGN5y9/view?usp=sharing My goal is to apply the…
0
votes
1 answer

matplotlib: histogram of SMOTEd class distribution showing colored synthetic region

Say I have a binary imbalanced dataset like so: from collections import Counter from sklearn.datasets import make_classification from matplotlib import pyplot as plt from imblearn.over_sampling import SMOTE # fake dataset X, y =…
user12587364
0
votes
0 answers

Predictions stuck at zero when positive label (1) is only 16% of data

So, I run the same code with a 50/50 split of 0 and 1 label, I get aboyt 70% accuracy on val set and my val preds are not stuck at 0. However, when I run the code on a dataset with 84/16 % split of labels 0 and 1, all my val preds end up being 0. I…
Mona Jalal
  • 34,860
  • 64
  • 239
  • 408
0
votes
0 answers

Multiclass Sampling Strategy

Scenario : Currently I am working on multiclass classification problem. I have 2 million historical dataset of having 180 classes and need to create model which will predict the classes accurately. I have created model using HybridGradientboosting…
0
votes
4 answers

How to handle imbalanced data in general

I have been working on the case study where data is highly imbalanced. we have been taught we can handle the imbalanced data by either under sampling the majority class or over sampling the minority class. I wanted to ask if there is any other…
0
votes
0 answers

samples with almost identical features but different classes and poor classification preformance(recall and precision)

I have 77000 text samples that 4900 of them are positive and about 72000 of them are negative (binary classification) and the maximum length of these samples are 15 (These samples are sentences). Not only are the data imbalanced but also positive…
soheila
  • 15
  • 1
  • 5
0
votes
0 answers

True Negatives have better prediction than True Positives

I have applied Logistic Regression on the data containing both binary and numerical predictors with a binary target. The confusion matrix of the results has True Negatives(65%) followed by False Positive(>20%) higher than True Positive(8%). I need…
0
votes
1 answer

How can I resolve imbalanced datasets for AutoML classification on GCP?

I am planning to use AutoML for the classification of my tabular data. But there is a moderate imbalance in the target variable. When running my own model, I would either upsample, downsample or build synthetic samples to resolve the imbalance. Is…
0
votes
0 answers

AUROC for imbalanced dataset

this is my first question here and I hope you can help me. At the moment I'm training a binary classifier for medical images and my dataset is imbalanced with a ratio of roughly 0.8 (negative) to 0.2 (positive). My code is written with pytorch and…
0
votes
0 answers

Does model underfitting based-on Accuracy matter for imbalanced data?

I am training a deep learning model on imbalanced data for binary classification. I used binary_crossentropy for the loss function and Accuracy for the metric. When I plotted the loss, I got an underfitting. Is that a problem as my data is…
0
votes
1 answer

Should we actively use the weight argument in loss functions

Most of the current machine learning libraries have loss functions that comes with a weight argument, which allows us to tackle unbalanced datasets. However should this feature be actively made use of? If not, are there certain guidelines as to when…
tangolin
  • 434
  • 5
  • 15
0
votes
1 answer

Binary data however oversampler states it is multilabeled

I am using the Kaggle's Twitter Dataset and I am trying to oversample the minority class. Despite y being binary, the oversampler returns an error stating that it is multi-class My x and y are the tweets and the labels respectively.
Randy Chng
  • 103
  • 5