3

I'm struggling to create a vectorized functional solution that will allow me to replicate stratified random sampling without replacement over many iterations. I'm able to sample without replacement once, then remove those rows from the dataset and then repeat the process from the unsampled observations. Unfortunately I'm needing to do this many times which makes this manual option impossible.

I've tried using the replicate() function, however I'm only able to have it sample without replacement for each pass. It puts the chosen samples back into the dataset for the next sampling pull.

Using the code below, I'd like the function to create 30 new datasets composed of 3 unique (previously unsampled) rows each from the "one" and "zero" sets. So each new dataset would have 6 total observations (3-1's and 3-0's) and be named something unique (i.e. "new_dat1", "new_dat2"..."new_dat30").

If possible, I'm looking to achieve all of this without using for loops, so something in the "apply" family is preferred.

set.seed(123)
dat <- data.frame(Outcome = round(runif(160, 0, 1)))
cust <- data.frame(Cust = rep(c("ABC", "DEF", "GHI"), c(45, 80, 35)))
dat <- cbind(cust, dat)

one <- subset(dat, Outcome == 1)
zero <- subset(dat, Outcome == 0)


# Manual option which is not sufficient    
################################################
# sample 1's and remove choosen obs from "one" dataset
set.seed(123)
index <- sample(1:nrow(one), 3, replace = FALSE)
new_dat1 <- one[index, ]
unused_one <- one[-index, ]

# sample 0's and remove choosen obs from "zero" dataset
set.seed(123)
index <- sample(1:nrow(zero), 3, replace = FALSE)
unused_zero <- zero[-index, ]

# combine the 3-1 and 3-0 samples into the first of 30 "new_datn" sets
new_dat1 <- rbind(new_dat1, zero[index, ])

# repeat, now sampling from "unused_one" and "unused_zero" to create "new_dat2" - "new_dat30"
################################################


# Failed attempt using the replicate() function
################################################
set.seed(123)
one_sample <- replicate(30, one[sample(nrow(one), 3, replace = FALSE), ], simplify = FALSE)
zero_sample <- replicate(30, zero[sample(nrow(zero), 3, replace = FALSE), ], simplify = FALSE)

Making this even more complicated is the fact that my total number of 0 and 1 observations in the "dat" set will vary from time to time so I'll likely always have remainders to deal with. So the function must be able to sample 3 for each "new_dat" until it runs into a remainder for the final set, which can go into the final "new_dat" regardless of the value.

Even if I could figure out how to solve the sampling issue in a vectorized function, I would really be at a loss to have the function create new datasets and name them appropriately.

I would be very grateful if anyone could provide me with some assistance. Thank you for taking the time to read through my post.

Brian
  • 195
  • 8

1 Answers1

1

If I understood what you want correctly, here is one solution.

First just sample the whole vector, that is, you are just going to randomly sort the row numbers:

sample_rows  <- sample(nrow(one))

Then assign a sample group for each of the randomly distributed rows (3 elements by group). Since the number of elements may not be divisible by 3, extend the length of the vector so it has the same length as the number of rows. Now fill the NA's with the next group (I think that is what you meant by "remainder for the final set"):

sample_group <- rep(1:(length(sample_rows)%/% 3), each = 3)
length(sample_group) <- length(sample_rows)
sample_group[is.na(sample_group)] <- max(sample_group, na.rm = TRUE) + 1

So now you have 24 samples of 3 and 1 sample of two, without replacement:

samples <- data.frame(sample_rows, sample_group)

head(samples)
  sample_rows sample_group
1          12            1
2           6            1
3          41            1
4          35            2
5          71            2
6          62            2

tail(samples)
   sample_rows sample_group
69          69           23
70          53           24
71          32           24
72          27           24
73          18           25
74          65           25

I did this for the vector of "one"s but you can easily replicate that for the vector of zeros and combine them.

PS: You can extract the rows from the data.frame using split() and lapply(). For example:

new_dat <- lapply(split(samples$sample_rows, samples$sample_group), function(x) one[x,])

So new_dat is a list with all 25 data.frames. For example:

new_dat[[8]] # gives you the eigth data.frame

Or:

new_dat[[25]] # gives you the last data.frame
Carlos Cinelli
  • 11,354
  • 9
  • 43
  • 66
  • 1
    I think your `sample_group` creation can be simplified with something like `0:(nrow(one)-1) %/% 3`, but +1 otherwise :-) – A5C1D2H2I1M1N2O1R2T1 Apr 23 '15 at 05:14
  • Sorry if my question was unclear. Essentially all I'm trying to do is start with one dataset made up of zero and one observations in a column, and I want to then create 30 new datasets made up of 3 random zero observations and 3 random one observations (6 total rows). The remainder comment meant that the total number of zeros and ones will not always be divisible by 3. In my example above, there were 86 zeros to start with which means we should really be putting 2.86 zeros in each new set. @Carlos Cinelli – Brian Apr 23 '15 at 13:32
  • Thank you both for your comments. It's more simple than I was expecting but I believe it will work if there's a way to automate taking it the next step. Now that each observation is assigned to a sample group, is there a way automate pulling them out by "sample_group" number and creating new datasets for each group number? So "new_dat1", "new_dat2".... "new_dat30"? Thanks! @CarlosCinelli – Brian Apr 23 '15 at 13:42
  • @Brian yes, you can pull them out, one way to do it with one line is by using `split()` and `lapply()`. Then you will have a list with all the 25 data.frames. `lapply(split(samples$sample_rows, samples$sample_group), function(x) one[x,])`. I will add this to the answer. – Carlos Cinelli Apr 23 '15 at 13:52
  • 1
    @CarlosCinelli Perfect! This will do the trick. Thank you for taking the time to help me find an answer. I appreciate it. I'm new to Stack Oveflow so unfortunately I can't upvote your solution, but thank you just the same. – Brian Apr 23 '15 at 14:03