Questions tagged [imbalanced-data]

Problem definition

Imbalanced data occurs in when:

  • "The user assigns more importance to the predictive performance... on a subset of the target variable domain."
  • "[T]he cases that are more relevant for the user are poorly represented in the training set."

Paula Branco, Luís Torgo, and Rita P. Ribeiro. (2016) A Survey of Predictive Modeling on Imbalanced Domains. ACM Computing Surveys, Volume 49, Issue 2.

Software

Related Tags and Techniques

351 questions
0
votes
0 answers

Which model to choose based on Precision and Recall values for imbalanced classes

I am working on wine quality dataset for predicting if the quality of wine is good or bad. I have used multiple classification models and calculated their accuracy/Precision/Recall score as shown below However , I can not rely on accuracy score as…
pankaj mishra
  • 2,555
  • 2
  • 17
  • 31
0
votes
0 answers

Multi class text classification when having only one sample for classes

I have a dataset of texts, each text was identified with an ID number. I would like to do a prediction by finding the best match ID number for upcoming new texts. To use multi text classification, I am not sure if this is the right approach since…
0
votes
0 answers

Neural Network does not generalize highly imbalanced data

I am fairly new to machine learning and am currently trying to build a simple feedforward neural network on severely imbalanced data. The data consists of 64 different variables (all normalized) and 1 binary variable (1 and 0) which the nn is…
0
votes
0 answers

SMOTE for regression on unbalanced features

I am working on a regression model, with numerical features and target. y : the weight of wastes collected in recycling bins Xi : features about demography or urban elements around, or appearance of the bin I noticed that my features that seems to…
Elise1369
  • 259
  • 1
  • 6
  • 19
0
votes
2 answers

How can I know which is the positive class value and negative class value for XGBoost?

I am working with an imbalanced dataset where I have a class variable of 2 different values: 0 and 1. The number of '0' values is 1000 and the number of '1' values is 3000. For XGBClassifier, LGBMClassifier and CatBoostClassifier I found that there…
jartymcfly
  • 1,945
  • 9
  • 30
  • 51
0
votes
0 answers

Smote function in R

Anyone knows how to set up the perc.over and perc.under in my case? I tried a couple of combination, but it did not give me good result. I want my target variable to be split into almost 50/50. I have 266776 for my training set, and the current…
Gracetam
  • 19
  • 1
  • 6
0
votes
1 answer

the definition of unbalanced sample

Unbalanced sample causes issues and more efforts as we know. When I am handling the issue, I am confused about the definition. Say, I have a training dataset of 200 cats, 200 dogs and 400 stones. When I am to classify the dataset, when classfying 3…
Grec001
  • 1,111
  • 6
  • 20
0
votes
3 answers

How to keep/extend index when oversample

I've got a dataframe like that , and I want to oversample the column "role" (in a real case the number of rows/columns in much bigger than this minimal example) role value pop_13vdpn1_site_1 1 1 pop_13vdpn1_site_1 1 …
psagrera
  • 141
  • 1
  • 9
0
votes
1 answer

How to set a class_weight Dictionary for Random Forest?

I'm dealing with an unbalanced dataset, so I decided to use a weight dictionary for classification. Documentation says that a weight dict must be defined as shown…
0
votes
1 answer

Class weights on imbalanced CNN

I am trying to implement a simple CNN classification on a set of x-ray images belonging to 4 classes. The dataset looks like this: img A B C D 1 [[[0, 0, 0], [0, 0, 0], [0, 0, 0],…
0
votes
1 answer

Cost Sensitive Classifier fails for heavily imbalanced datasets

I am going to try to keep this as specific as possible but it is kind of a general question as well. I have a heavily skewed dataset in the order of { 'Class 0': 0.987, 'Class 1':0.012 } I would like to have a set of classifiers that work well on…
0
votes
1 answer

How to get better precision and recall with imbalanced dataset in python

I am working on a medicare fraud detection model. The data is very very imbalanced with 14 fraudulent positive cases and approximately 1 million non-fraudulent cases. I initially had 8 features, but with one-hot encoding of my categorical variables,…
0
votes
0 answers

Binary Classification on Unbalanced Medical Datasets

I want to work on a binary classification problem using a medical dataset consisting of 35K images. I have a few questions on the same. 1.) The architectures like VGG, Inception, etc which are typically used on datasets like COCO, ImageNet, etc are…
0
votes
0 answers

Duplicating samples of time series

I have a highly imbalanced dataset: from collections import Counter unique1, counts1 = np.unique(labels_ds , return_counts=True) dict(zip(unique1, counts1)) print('Original dataset shape {}' .format(counts1)) #returns #Original dataset shape…
0
votes
0 answers

sk learn, what to do when the data I want to predict has different distribution with data I have right now

To be specific, I am now working with a data with 100,000 rows and 20 features, my target variable is categorical so I use random forest classifier, Xgboost, LogisticRegression, etc. I have a binary feature 'A', which in my dataframe only 20% of…