1

I'd like to divide a sample into two groups such that there is proportional representation of 2 or more variables in those two groups. For instance, in the mtcars dataset, here are the proportions of the last 3 variables in the data.frame:

> data(mtcars)
> round(prop.table(table(mtcars$carb)),2)

   1    2    3    4    6    8 
0.22 0.31 0.09 0.31 0.03 0.03 
> round(prop.table(table(mtcars$gear)),2)

   3    4    5 
0.47 0.38 0.16 
> round(prop.table(table(mtcars$am)),2)

   0    1 
0.59 0.41 

In this example, I'd like to divide the sample into two groups such that there is something close to a 60/40 split on am with splits similar on the other two variables to their representation in the dataset.

The closest thing I know how to do is to draw a matched sample, like in a treatment study, but in that case the two groups are already defined based on some treatment variable, and you're simply matching a control unit to a treatment unit such that the proportions are similar to each other on 1+ covariates. This is a little different, and while I feel like there must be a similar method to use, I can't wrap my head around it. Is there an efficient way to do this? Or is there a totally different way I should be thinking about this?

Jon
  • 753
  • 8
  • 18
  • Use the `prob` argument in `sample`. – GKi Apr 20 '23 at 12:47
  • How can I use the `prob` argument in `sample` when there is more than one variable giving proportions that I'm trying to match? – Jon Apr 20 '23 at 15:36
  • E.g. taking those from `proportions(table(mtcars[c("carb", "gear", "am")]))`. – GKi Apr 20 '23 at 19:31
  • Oh I see, so working with the joint probabilities over the multiple variables. I think that would work OK for 2-3 variables, but do you know of a way that can approximate the marginal proportions of the different variables without modeling the joint probabilities? – Jon Apr 21 '23 at 17:20
  • Maybe multiplying them like `outer(proportions(table(mtcars$carb)), proportions(table(mtcars$gear)))` ? – GKi Apr 24 '23 at 07:21

0 Answers0