Questions tagged [imbalanced-data]

Problem definition

Imbalanced data occurs in when:

  • "The user assigns more importance to the predictive performance... on a subset of the target variable domain."
  • "[T]he cases that are more relevant for the user are poorly represented in the training set."

Paula Branco, Luís Torgo, and Rita P. Ribeiro. (2016) A Survey of Predictive Modeling on Imbalanced Domains. ACM Computing Surveys, Volume 49, Issue 2.

Software

Related Tags and Techniques

351 questions
0
votes
1 answer

How to handle imbalanced dataset for CheXpert data on a classification problem from radiography images

I am working on an image classification problem using CNN and DNNs to be more specific. But the data at hand is highly imbalanced and hence giving highly skewed results. It is predicting everything as true or everything as false. I have tried the…
0
votes
1 answer

Problem with Over- and Under-Sampling with ROSE in R

I have a dataset to classify between won cases (14399) and lost cases (8677). The dataset has 912 predicting variables. I am trying to oversample the lost cases in order to reach almost the same number as the won cases (so having 14399 cases for…
user8383689
0
votes
1 answer

Imbalance in multi class classification problem - four target levels

I am having imbalance in my data as shown below, Whenever I have tried with ADASYN it shows error, Do we need to provide any parameter entry for the same ? Some time it runs for long time but no response even after 40 minutes of code run. …
0
votes
0 answers

I have an error in missing values are not allowed in subscripted assignments of data frames

I am new to R and I am constructing R codes for my personal project/exercise. The data I am using is about a survey on ethnic identity of people from Hongkong. I used 2019 data from http://data.hkupop.hku.hk/v3/hkupop/ethnic_identity/ch.html. After…
0
votes
2 answers

How to deal with rasa nlu data imbalance problem?

Now I have 12 intents to identify,But the amount of data for each intent is inconsistent,Like meeting settings, reminding these intentions, the amount of data will be thousands.But like greetings, thank you for such an intention, there are very few…
shaojie
  • 121
  • 1
  • 11
0
votes
1 answer

For an imbalanced dataset, is it better to use oversampling or undersampling techniques?

I have a binary classification problem where the dataset is imbalanced, I don't know what to use between undersampling and oversampling!!
0
votes
0 answers

Is there a more efficient way to oversample data than random.sample()?

I got a big unbalanced classification problem and want to address this issue by oversampling the minor classes. (N(class 1) = 8,5mio, N(class n) = 3000) For that purpose I want to get 100.000 sample for each of the n classes by data_oversampled =…
Quastiat
  • 1,164
  • 1
  • 18
  • 37
0
votes
2 answers

Passing a list as loss_weights, it should have one entry per model output. Keras tells me that the model has 1 output but I thought having more

I have a dataset df for a multiclass classification problem. I have a huge class imbalance. Namely, grade_F and grade_G. >>> percentage = 1. / df['grade'].value_counts(normalize=True) >>> print(percentage ) B 0.295436 C 0.295362 A …
0
votes
1 answer

Deep Learning with Small Datasets and SMOTE

I have a data with 6000 records. I am having a train, validate and test set of 60-20-20. I am getting an accuracy of around 76% with XGboost. I converted my data into Time series and I apply LSTM/1-D Convnets and the accuracy is around 60%. Is my…
0
votes
1 answer

Why we use the loss to update our model but use the metrics to choose the model we need?

First of all,I am confused about why we use the loss to update the model but use the metrics to choose the model we need. Maybe not all of code, but most of the code I've seen does,they use EarlyStopping to monitor the metrics on the validation…
JALS
  • 23
  • 5
0
votes
2 answers

Multi-features modeling based on one binary-feature which is rarely 1 (imbalanced data) when there is a cost

I need to model a multi-variate time-series data to predict a binary-target which is rarely 1 (imbalanced data). This means that we want to model based on one feature is binary (outbreak), rarely 1? All of the features are binary and rarely 1. What…
0
votes
1 answer

How to take a more balanced sample data Python

I have a dataframe with nomalized percentage info. Eg. wordCount number Percent 2.0 1282 0.267345 1.0 888 0.185213 3.0 1124 0.170791 4.0 1250 0.152877 5.0 554 0.084864 6.0 333 0.058904 7.0 …
Jennifer
  • 19
  • 2
  • 6
0
votes
1 answer

Retrieve the indices for only the resampled instances after oversampling using imbalanced-learn?

For a binary text classification problem with imbalanced data, I use imbalanced-learn library's function RandomOverSampler to balance the classes. Now, I want to retrieve only the instances that were oversampled (replicated) from the original data.…
PinkBanter
  • 1,686
  • 5
  • 17
  • 38
0
votes
2 answers

NaNs with customised weighted F1-Score in Keras

I need to compute a weighted F1-score in such a way to penalize more errors over my least popular label (typical binary classification problem with an unbalanced dataset). Unfortunately, I don't get a valid F1-score. The followings are my metrics…
Guido
  • 441
  • 3
  • 22
-1
votes
0 answers

“DataConversionWarning” when Training Logistic Regression Model with Unbalanced Data

X, y = make_classification(n_samples=1000, n_features=10, weights=[0.9, 0.1], random_state=42) X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42) model = LogisticRegression() model.fit(X_train,…