0

I have a banking dataset which has 5% defaulters and the rest are good( non-defaulters).

I want to create a sample which has 30% defaulters , 70% non-defaulters.

Assuming my dataset is data and it has a column named "default" signifying 0 or 1, how do i get a sample with 30% default, 70% non-default given that my original dataset has only 5% default.

Can some one please provide the R code. That would be great. I tried the following to get 100 random samples with replacement

data[sample(1:nrow(data),size=100,replace=TRUE),]

But how do i ensure that I get that the split is 30%,70%?

justintime
  • 83
  • 1
  • 8

2 Answers2

0

sample has an option prob that represents a vector of probability weights for obtaining the elements of the vector being sampled. So you could use prob=c(0.3,0.7) as a parameter to sample.

For example

sample(0:1, 100, replace=TRUE, prob=c(0.3,0.7))
iugrina
  • 605
  • 4
  • 7
0

Assume df is your dataframe and default is the column indicating who defaults.

To sample without replacement:

df[c(sample(which(df$default),30), sample(which(!df$default),70)),]

To sample with replacement (i.e., possibly duplicating records):

df[c(sample(which(df$default),30,TRUE), sample(which(!df$default),70,TRUE)),]

Alternatively, if you don't want to specify an exact number of defaulters and non-defaulters, you can specify a sampling probability for each row:

set.seed(1)
df <- data.frame(default=rbinom(250,1,.5), y=rnorm(250))

n <- 100 # could be any number, but closer you get to nrow(df) the less the weights matters
s <- sample(seq_along(df$default), n, prob=ifelse(df$default, .3, .7))
table(df$default[s])
#
#  0  1 
# 61 39 

n <- 150 # could be any number, but closer you get to nrow(df) the less the weights matters
s <- sample(seq_along(df$default), n, prob=ifelse(df$default, .3, .7))
table(df$default[s])
#
#  0  1 
# 97 53
Thomas
  • 43,637
  • 12
  • 109
  • 140