Questions tagged [oversampling]

Oversampling and undersampling in data analysis are techniques used to adjust the class distribution of a data set (i.e. the ratio between the different classes/categories represented).

156 questions
0
votes
1 answer

Creating R's formula using Python

I am writing a program that interacts with R using Python. Basically, I have some R libraries that I want to ingest into my Python code. After downloading rpy2, I define my R functions that I want to use in a separate .R file script. The R function…
Perl Del Rey
  • 959
  • 1
  • 11
  • 25
0
votes
1 answer

Undersampling with image data in python

main idea of undersampling is randomly delete the class which has sufficient observations so that the comparative ratio of two classes is significant in our data. So, how to undersampling with image data in python? please help me:( I took the…
hilyap
  • 1
  • 2
0
votes
2 answers

SMOTE-NC in R. No packages found

I have a dataset with 5 nominal and 37 categorical variables. I want to perform oversampling in R. However, with SMOTE, I cannot do so. I looked for SMOTE-NC as advised by (Chawla, Bowyer and Hall, 2002), but I could not find any package supporting…
0
votes
1 answer

Upsampling tweets using SMOTE

I have an imbalanced dataset of tweets labeled as -1, 0, +1. I wanna balance the numbers by upsampling. I receive the following error: tweet_train=tweet_train.reshape(-1, 1) X_train_upsample, y_train_upsample =…
Vahid the Great
  • 393
  • 5
  • 18
0
votes
1 answer

Problem with Over- and Under-Sampling with ROSE in R

I have a dataset to classify between won cases (14399) and lost cases (8677). The dataset has 912 predicting variables. I am trying to oversample the lost cases in order to reach almost the same number as the won cases (so having 14399 cases for…
user8383689
0
votes
1 answer

Upsampling with 64 hz in R

I have data in the below format. Sample data pasted here. It basically has 3 variables Start time, End time & set of values between these timestamps. The sampling rate is 64Hz Now I need output in the following format with difference between two…
Coolsun
  • 189
  • 9
0
votes
0 answers

ValueError: could not convert string to float SMOTE fit_sample Python Oversampling

I have a credit risk analysis dataset which goes like this: Loan_ID Age Income(LPA) Employed_yr Education Loan_status 1 18 2.4 1 12th 1 2 46 43 26 …
noob
  • 3,601
  • 6
  • 27
  • 73
0
votes
0 answers

Is there a more efficient way to oversample data than random.sample()?

I got a big unbalanced classification problem and want to address this issue by oversampling the minor classes. (N(class 1) = 8,5mio, N(class n) = 3000) For that purpose I want to get 100.000 sample for each of the n classes by data_oversampled =…
Quastiat
  • 1,164
  • 1
  • 18
  • 37
0
votes
1 answer

Confused for the Code for over-sampling with R

The code below is about oversampling houses with over 10 rooms, may I ask what does prob = ifelse(housing.df$ROOMS>10, 0.9, 0.01) mean? Thanks a lot. s <- sample(row.names(housing.df), 5, pro = ifelse(housing.df$ROOMS>10, 0.9, 0.01)) housing.df[s.]
Lea DM
  • 1
0
votes
0 answers

Create RandomForest training without splitting the data. I have training data in one file and test data in another file

I want to try using the random forest classifier in python without using train_test_split. I have a training dataset in one file and I want to train the python machine learning model using the training dataset and then I want to apply the model on…
NikhilR
  • 1
  • 2
0
votes
1 answer

Retrieve the indices for only the resampled instances after oversampling using imbalanced-learn?

For a binary text classification problem with imbalanced data, I use imbalanced-learn library's function RandomOverSampler to balance the classes. Now, I want to retrieve only the instances that were oversampled (replicated) from the original data.…
PinkBanter
  • 1,686
  • 5
  • 17
  • 38
0
votes
1 answer

SMOTE in python

I am trying to use SMOTE in python and looking if there is any way to manually specify the number of minority samples. Suppose we have 100 records of one class and 10 records of another class if we use ratio = 1 we get 100:100, if we use ratio 1/2,…
0
votes
1 answer

Oversampling with Leave One Out Cross Validation

I am working with an extremely unbalanced dataset with a total of 44 samples for my research project. It is a binary classification problem with 3/44 samples of the minority class for which I am using Leave One Out Cross Validation. If I perform…
0
votes
0 answers

Using SMOTE on training data

I have an unbalanced dataset and I want to use SMOTE. I am working with Azure ML. I have read many examples in the Microsoft Doku page. I am wondering why the SMOTE is set before the SPLIT DATA function and not after the SPLIT DATA on the 70%…
Mutatos
  • 1,675
  • 4
  • 25
  • 55
0
votes
1 answer

Unbalanced dataset resulting in high false positives after using SMOTE

I am working on a binary classification imbalanced marketing dataset which has: No:Yes ratio of 88:12 (No-didn't buy the product, yes-bought) ~4300 observations and 30 features (9 numeric and 21 categorical) I divided my data into train (80%) &…