Questions tagged [oversampling]

Oversampling and undersampling in data analysis are techniques used to adjust the class distribution of a data set (i.e. the ratio between the different classes/categories represented).

156 questions
0
votes
0 answers

Generating Artificial data from real data

I have a dataframe consisting 2000 rows and 5 features (columns) as follows: my_data: Id, f1, f2, f3, f4(target_value) u1 34 sd 43 1 u1 30 fd 3 0 u1 01 …
Spedo
  • 355
  • 3
  • 13
0
votes
1 answer

Function for cross validation and oversampling (SMOTE)

I wrote the below code. X is a dataframe with the shape (1000,5) and y is a dataframe with shape (1000,1). y is the target data to predict, and it is imbalanced. I want to apply cross validation and SMOTE. def Learning(n, est, X, y): s_k_fold =…
BTurkeli
  • 91
  • 1
  • 2
  • 15
0
votes
1 answer

ML with imbalanced binary dataset

I have a problem I am trying to solve: - imbalanced dataset with 2 classes - one class dwarfs the other one (923 vs 38) - f1_macro score when the dataset is used as-is to train RandomForestClassifier stays for TRAIN and TEST in 0.6 - 0.65…
Greem666
  • 919
  • 13
  • 24
0
votes
1 answer

Error in Rose sampling when balancing data with categorical variables

I'm trying to balance my data in which the majority class has proportion of 99% while the rare class has 1%. My response variable is binary and my independent variables are both binary, integer and categorical variables. I'm using ROSE function of…
Cigdem
  • 1
  • 2
0
votes
0 answers

Does multicore processing in IPython kernel Jupyter Notebook really speedup execution time?

I'm running a dataset oversampling code on a python3 jupyter notebook:- Snippet sm = SVMSMOTE(random_state=42) X_res, Y_res = sm.fit_resample(X,Y) but this is taking too long to execute. When I checked the system monitor, it showed that only one…
0
votes
1 answer

What is the best way to oversample a dataframe preserving its statistical properties in Python 3?

I have the following toy df: FilterSystemO2Concentration (Percentage) ProcessChamberHumidityAbsolute (g/m3) ProcessChamberPressure (mbar) 0 0.156 1 29.5 …
Miguel 2488
  • 1,410
  • 1
  • 20
  • 41
0
votes
1 answer

SMOTE Algorithm and Classification: overrated prediction success

I'm facing a problem about which I can't find any answer. I have a binary classification problem (output Y=0 or Y=1) with Y=1 the minority class (actually Y=1 indicates default of a company, with proportion=0.02 in the original…
T. Ciffréo
  • 126
  • 10
0
votes
0 answers

Error in Oversampling example in R

I am runing below code for oversampling in R varNames1 = paste0("Quote.Type","+","Quote.State","+","Forecast.Type","+","Suggested.Reseller.Discount","+","Territory","+","Pricing.Type") ctrl <- trainControl(method = "repeatedcv", …
0
votes
0 answers

could not balanced large dataset

i tried various techniques such as oversampling, undersampling, ROSE and both(oversampling and undersampling) on a imbalanced dataset to balance a dataset. when i applied all these techniques on a small dataset then these techniques perfectly…
maira khan
  • 43
  • 1
  • 8
0
votes
0 answers

Any reason not to use oversampling and undersampling together?

This has been bothering me for quiet some time. If oversampling and undersampling both have their pros and cons, why not use them together to minimize their weaknesses? I just couldn't find a paper or an article that says they've used both or we…
user8397275
  • 131
  • 1
  • 8
0
votes
0 answers

oversampling doesn't generate new samples

My dataset has the following distribution: class frequency 0 960 1 2093 2 22696 3 1116 4 2541 5 1298 6 14 I am using python-imblearn to oversample the minority class. With regular smote I am…
0
votes
2 answers

what is the differene between Stratify and StratifiedKFold in python scikit learn?

My data consists of 99% target variable = 1, and 1% target variable = '0'. Does stratify guarantee that the train tests and test sets have equal ratio of data in terms of target variable. As in containts, equal amounts of '1' and '0'? Please see…
user9238790
-1
votes
1 answer

How to get indices of created samples in Imblearn

I am using different imblearn over-sampling methods on a data-set which contains ~55800 samples. About 200 are class 1, the rest class 0. I am oversampling class 1 with various over-sampling-strategies. It does not improve my model quality and…
Andreas bleYel
  • 463
  • 2
  • 5
  • 7
-1
votes
2 answers

Create row of most frequent value in each dataframe column

CONTEXT I want to create a top row with the most frequent values of each column. CURRENT CODE df = df.loc[df['Gender'] == 'M'] df = df('Gender').count() DATA SAMPLE Gender Eyes Hair Height M Brown Brown >6ft M …
KL_
  • 293
  • 6
  • 22
-1
votes
1 answer

What is correct way of sampling a highly imbalanced dataset which has low between feature correlation and low between class variance?

I have a dataset with 23 features with very low correlation. The two classes have low variance between the classes. The classes are highly imbalanced like that of data available for fraud detection. What is suitable approach for sampling this kind…
1 2 3
10
11