Random sampling from a data.table with different draws based on the categorical value in a column

Question

I have a data table with different 20 sample IDs. Now I want to reduce the sample size randomly with a fixed distribution of IDs, meaning that I want to randomly draw lets say 7 values out of 'A' and 5 values out of 'B' so my data.table has 12 rows instead of 20 and than build the mean of a column I generated. Now I want to repeat that 100 times via bootstrapping and see if the means vary, so I want to do some statistics like sd, mean, etc. on it.

The background is I have a small set and a bigger sample set. I want to reduce the bigger sample set to evaluate the accurarcy of the smaller sample set. I am fairly new to R and appreciate any help. Thanks

data <- data.table(Sample = c('A','A','A','A','A','A','A','A','A','A','A','B','B','B','B','B','B','B','B','B','B','B'),
                   weight=rnorm(1:22),
                   height=rnorm(1:22))

# I want to draw randomly 7 values out of A and 5 values out of B and than get the mean of this new df and do that whole step 100 times
#to again build the mean over all 100 replicates

set.seed(4561)

new_df <- data %>%
  group_by(Sample) %>% 
  nest() %>%            
  mutate(n = c(7,5)) %>% 
  mutate(samp = map2(data, n, sample_n)) %>% 
  select(Sample, samp) %>%
  unnest() %>%
  mutate(diff.height.weight = height-weight) %>%
  mutate(means = mean(diff.height.weight))%>%
  bootstraps(means, times=100)

dcarlson · Answer 1 · 2019-07-15T04:02:13.437

I think you are overthinking this. First though, R is a sprawling expanse of base and contributed packages. The same function names can and do exist in different packages. You need to tell us what packages you have loaded using library() functions or we cannot reproduce your code without a lot of trial and error. If I understand correctly, you want to randomly select values from two samples, combine them and compute the mean and you want to do this 100 times. First create the data:

data <- data.frame(Sample = rep(c('A', 'B'), each=11), weight=rnorm(22),
     height=rnorm(22))
data$diff <- data$height - data$weight
str(data)
# 'data.frame': 22 obs. of  4 variables:
#  $ Sample: Factor w/ 2 levels "A","B": 1 1 1 1 1 1 1 1 1 1 ...
#  $ weight: num  0.5324 -0.0905 0.1565 -0.7373 -0.2013 ...
#  $ height: num  -0.3654 0.8166 -0.0606 -0.5014 0.9261 ...
#  $ diff  : num  -0.898 0.907 -0.217 0.236 1.127 ...

I'm keeping it simple by just using a data frame. The rnorm() function just needs to know how many values to create, it does not need a vector. We also just compute the differences once and store them in the data frame along with the other information.

Now we need to identify which values are A and which are B:

rows <- seq_along(data$diff)
a <- rows[data$Sample=="A"]
b <- rows[data$Sample=="B"]

To draw a sample we just select from the row numbers:

set.seed(42)
smp <- c(sample(a, 7), sample(b, 5))
# smp <- c(sample(a, 7, replace=TRUE), sample(b, 5, replace=TRUE))

The commented out line draws the sample with replacement which is typical for bootstrapping so you may want that instead. Now we compute the mean of the sample:

mn <- mean(data$diff[smp])
mn
# [1] -0.05161422

Finally we do this 100 times:

mns <- replicate(100, mean(data$diff[c(sample(a, 7), sample(b, 5))]))
# mns <- replicate(100, mean(data$diff[c(sample(a, 7, replace=TRUE), 
         sample(b, 5, replace=TRUE))]))
mean(mns)
# [1] 0.2700163
sd(mns)
# [1] 0.2819093
quantile(mns)
#           0%         25%         50%         75%        100% 
#  -0.41813426  0.09958492  0.26071086  0.45378608  0.94693304

Random sampling from a data.table with different draws based on the categorical value in a column

1 Answers1