0

Let's say I have a dataset that looks like this:

set.seed(2016)

d <- data.frame(x=rnorm(1000),
                y=sample(x=c("A", "B", "C"), size=1000, replace=TRUE))

I have some method that selects a subset:

s <- data.frame(x=rnorm(100), 
                y=sample(x=c("A", "A", "B", "C"), size=100, replace=TRUE))

The subset has a different has a different distribution of y:

prop.table(table(d$y))
   A     B     C 
0.335 0.349 0.316  

prop.table(table(s$y))
A    B    C 
0.44 0.34 0.22

Given the classes, y, for the full data set, d, and the subset, s, how can I draw a sample from d with the same class distribution and size as s?

Preferably, I would like the results as vector of indices of d.

Misconstruction
  • 1,839
  • 4
  • 17
  • 23
  • 1
    You could get pretty close using something like `sample(levels(d$y), 100, replace = TRUE, prob = prop.table(table(d$y)))` and of course let us not forget `sample_n` and `sample_frac` from `dplyr`. – David Arenburg Jan 14 '16 at 13:31
  • 1
    Closer to the original distribution than @DavidArenburg approach: `sample(rep(levels(d$y),100*prop.table(table(d$y))))` – nicola Jan 14 '16 at 13:33
  • This returns just a vector of classes, and not a vector of indices in d – Misconstruction Jan 14 '16 at 13:39

0 Answers0