Stratified sampling with constraints

Question

I'm a newbie in R so just bear with me.

So I'm trying to perform stratified sampling in such a way that, it will use a 2 column strata but with both columns satisfying specific values.

This is my code:

library(splitstackshape)
set.seed(1)
dat1 <- data.frame(ID = 1:100,
                   A = sample(c("AA", "BB", "CC", "DD", "EE"), 100, replace = TRUE),
                   B = sample(c(30,40,50),100,replace = TRUE), C = sample(c(1:10),100,replace = TRUE),
                   D = sample(c("CA", "NY", "TX"), 100, replace = TRUE),
                   E = sample(c("M", "F"), 100, replace = TRUE))

stratified(dat1, c("B", "C"), 0.1, select = list(B = 30, C = c(8:10)))

To my understanding this function first generates a strata of size 10% and from that it selects those records that satisfies the condition B=30 and c between 8 and 10.

As a result the size of the strata gets reduced from the initial 10%.

What my question is that, is there any way that will generate a strata which consists of records in which column B is having value 30 while column C can have values between 8 and 10 with the nrow() of the resultant sample being 10% of the original data frame?

I'm using stratified() from "splitstackshape". If stratified() cannot handle this, are there any other packages out there that can perform this kind of operation?

Please use `set.seed` when using such functions as `sample` to ensure reproducibility. — Sotos, Sep 07 '17 at 13:36

A5C1D2H2I1M1N2O1R2T1 · Accepted Answer · 2017-09-08T11:33:45.023

Update

Continuing from the sample data in the original answer, I would use a two-step process:

Create a subset with the levels you're interested in.

sub1 <- as.data.table(dat1)[B == 30 & C %in% 8:10][order(C)]

Figure out what percentage you need to sample. Here, I've set the final number of rows to 500, since the sample data doesn't have 1000 rows when a subset is taken. To get the required percentage, it's as simple as the desired number of rows divided by the total number of rows in the subset...
```
rows_wanted <- 500
set.seed(2)
out <- stratified(sub1, "C", rows_wanted/nrow(sub1))

## Check how many rows we have per group
out[, .N, .(B, C)]
#     B  C   N
# 1: 30  8 157
# 2: 30  9 169
# 3: 30 10 174
```

Original answer

The stratified function filters the data first, and then does the sampling. Consider the following:

library(splitstackshape)
set.seed(1)
n <- 10000
dat1 <- data.frame(ID = sequence(n),
                   A = sample(c("AA", "BB", "CC", "DD", "EE"), n, replace = TRUE),
                   B = sample(c(30,40,50),n,replace = TRUE), 
                   C = sample(c(1:10),n,replace = TRUE),
                   D = sample(c("CA", "NY", "TX"), n, replace = TRUE),
                   E = sample(c("M", "F"), n, replace = TRUE))

Sample, as you've shown.

mySample <- stratified(dat1, c("B", "C"), 0.1, select = list(B = 30, C = 8:10))
nrow(mySample)
# [1] 98

Compare that to how many rows you should expect in the output:

as.data.table(dat1)[, .N, .(B, C)][B == 30 & C %in% 8:10, list(N = round(N * .1)), .(B, C)][order(C)]
#     B  C  N
# 1: 30  8 31
# 2: 30  9 33
# 3: 30 10 34

And compare the above to what you get from the stratified function.

mySample[, .N, .(B, C)]
#     B  C  N
# 1: 30  8 31
# 2: 30  9 33
# 3: 30 10 34

I'm sorry, correct me if I'm wrong but in your code the original data frame has size 10000. When you are sampling it with size 10% along with the column constraints, the resultant sample has an nrow() of 98(which was my initial issue) which is not 10% of the initial data frame(1000 in this case). Allow me to rephrase my question, i'm trying to find a way to do stratified sampling with column constraints with the size of the sample not changing from the given 10%. I hope i was able to make myself understood. — Marek, Sep 08 '17 at 06:21
@Rowen, please see my update and let me know if that' along the lines of what you're trying to do. — A5C1D2H2I1M1N2O1R2T1, Sep 08 '17 at 11:34

Rui Barradas · Answer 2 · 2020-09-27T08:41:35.343

0

With your data this doesn't seem to be possible, at least not if you are sampling without replacement.

idx <- which((dat1$B == 30) & (dat1$C %in% 8:10))
idx <- sample(idx, 0.1*nrow(dat1))

Error in sample.int(length(x), size, replace, prob) :
cannot take a sample larger than the population when 'replace = FALSE'

The problem is that the number of rows that verify the two conditions is less than 10% of your data. The vector idx has length 5 only.

idx
#[1] 15 18 43 60 93

dat1[idx, ]
#   ID  A  B  C  D E
#15 15 DD 30  9 CA F
#18 18 EE 30 10 NY M
#43 43 DD 30 10 NY F
#60 60 CC 30 10 NY M
#93 93 DD 30 10 TX M

edited Sep 27 '20 at 08:41

answered Sep 07 '17 at 13:52

Rui Barradas

70,273
8
34
66

Actually, i was just testing this on a sample data. My actual data contains about 10 M records. Correct me if i'm wrong, but sample() just does random sampling, right? Actually i want to do stratified sampling. – Marek Sep 07 '17 at 14:00
@Rowen You are right, `sample` just does random sampling. For stratified sampling, take a look at package `sampling`, function `strata`. – Rui Barradas Sep 07 '17 at 14:03
I did, but i didnt notice any way to provide column based conditions in that. If u do come across something like that, do be so kind as to let me know. Thanks – Marek Sep 07 '17 at 14:12

Stratified sampling with constraints

2 Answers2

Update

Original answer