2

I am new to R, I recently used stratified sampling for train and test split to ensure target label is in equal proportion for both now I want to use down-sample the training data such that the population distribution/ train distribution is similar to the new down-sample distribution.

The reason I want to down-sample is because I have 11 Million rows with 56 columns and it will take days to do parameter tuning via grid/random/Bayesian search

I am using XGboost and it's is a binary classification problem

I would really appreciate if someone can help me on this.

Below is my code

    train_rows = sample.split(df$ModelLabel, SplitRatio=0.7) ## Stratiefied sampling 
    train = df[ train_rows,]
    test  = df[!train_rows,]`enter code here`
Dexter1611
  • 492
  • 1
  • 4
  • 15

1 Answers1

0

The easiest way you can achieve this is calculating the ratio between the 2 classes. Let's say out of 11 million there are 3 million 0's and 8 million 1's. So, your 0:1 ratio is 3:8. Now, let's say you want to downsample it to 1 million rows, you can randomly select 1 million rows maintaining the same ratio i.e. 3:8. So mathematically, its about 2.7 lakh(approx) class 0 and 7.3 lakh class 1 samples(approx). You can calculate the exact number yourself. Now, you can use Dataframe.sample() function to get the downsampled data. I am writing the python code for the same.

df_class_0 = df[df.target == 0]
df_class_1 = df[df.target == 1]
df_class_0_under = df_class_0.sample(2.7 lakh)
df_class_1_under = df_class_1.sample(7.3 lakh)
df_test_under = pd.concat([df_class_0_under, df_class_1_under], axis=0)
Vatsal Gupta
  • 471
  • 3
  • 8