stratified sampling with group size below sample size in R

Question

I have response data by market in the format:

head(df)
    ID  market  q1  q2
    470 France  1   3
    625 Germany 0   2
    155 Italy   1   6
    648 Spain   0   5
    862 France  1   7
    699 Germany 0   8
    460 Italy   1   6
    333 Spain   1   5
    776 Spain   1   4

and the following frequencies:

 table(df$market)
    France  140
    Germany 300
    Italy   50
    Spain   75

I need to create a data frame with a sample of 100 responses per market, and all responses without replacement in cases when there's less than 100 of them.

so

table(df_new$market)
        France  100
        Germany 100
        Italy   50
        Spain   75

Thanks in advance!

Have you tried to get subsets of data for each group individually and combine them together? For example, you have less than 100 samples for the group, take all of them, if you have more than 100 samples, randomly select 100 from the sample. — TYZ, Apr 02 '14 at 18:06
yes, this makes sense for a smaller group. When the number of factors in the group is large, I was hoping to have a function have the check built in. — user2091904, Apr 02 '14 at 18:09
Write a for loop with 2 conditions: if the group size is less than 100, take all, if the group size is greater than 100, get a subset of it. — TYZ, Apr 02 '14 at 18:11
You can also use raking if you want to weight the responses differently. The `survey` package has a rake function that does this. I can write out the methodology more in detail if you want. — Max Candocia, Apr 02 '14 at 18:18

score 0 · Answer 1 · answered Apr 02 '14 at 21:13

The following looks valid:

set.seed(10); DF = data.frame(c1 = sample(LETTERS[1:4], 25, T), c2 = runif(25))
freqs = as.data.frame(table(DF$c1))
freqs$ss = ifelse(freqs$Freq >= 5, 5, freqs$Freq)
#> freqs
#  Var1 Freq ss
#1    A    4  4
#2    B   11  5
#3    C    7  5
#4    D    3  3
res = mapply(function(x, y) DF[sample(which(DF$c1 %in% x), y), ], 
             x = freqs$Var1, y = freqs$ss, SIMPLIFY = F)
do.call(rbind, res)
#   c1        c2
#5   A 0.3558977
#17  A 0.2289039
#6   A 0.5355970
#13  A 0.9546536
#3   B 0.2395891
#25  B 0.8015470
#10  B 0.4226376
#15  B 0.5005032
#19  B 0.7289646
#11  C 0.7477465
#9   C 0.8998325
#12  C 0.8226526
#1   C 0.7066469
#4   C 0.7707715
#23  D 0.4861003
#20  D 0.2498805
#21  D 0.1611833

stratified sampling with group size below sample size in R

1 Answers1

Linked