I have a large dataset, x
, that contains replicated values,
some of which are duplicated across its variables:
set.seed(40)
x <- data.frame(matrix(round(runif(1000)), ncol = 10))
x_unique <- x[!duplicated(x),]
I need to sample all instances of each unique row in x a given number of times so I create a new variable that is simply a concatenation of the variables for each row:
# Another way of seeing x is as a single value - will be useful later
x_code <- do.call(paste0, x)
u_code <- x_code[!duplicated(x)]
We need a repeated sample sample from x, replicating each unique row s times. This information is provided in the vector s:
s <- rpois(n = nrow(x_unique), lambda = 0.9)
The question is, how to sample individuals from x to reach the quota set by s, for each unique row? Here's a long and unbeautiful way, that gets the right result:
for(i in 1:length(s)){
xs <- which(x_code %in% u_code[i])
sel <- c(sel, xs[sample(length(xs), size = s[i], replace = T)])
}
x_sampled <- x[sel, ]
This is slow to run and cumbersome to write.
Is there a way to generate the same result (x_sampled
in the above) faster and more concisely? Surely there must be a way!