I want to create a sub-sample of data frame df
, depending on the frequency of a given category in one of its columns, e.g. a
.
Let's assume we have a data frame like this:
df <- data.frame(a = rep(1:4, c(3, 9, 4, 8)),
b = runif(24))
then I want to get a sub-sample of rows, proportional to the categories in column a
, first in a random way:
smpl <- unlist(lapply(1:4, \(x) sample(c(TRUE, FALSE),
size = sum(x==df$a),
replace = TRUE)))
df[smpl,]
Here sample
leads to the intended effect, that half of the records are returned on average for each category. However, it may be more or less (and even zero) for a category in a specific case.
I am also looking for second "more deterministic" approach, where only the cases are selected at random, but returns for each category either 50% of cases in the even case or N %/% 2
resp. N %/% 2 +1
records in the uneven case. The code should be easily readable.