I am using bootstrapping to get 95% CIs for mean calculations for several Evaluation Units (EUs). The calculation without bootstrapping is
EU prevalence = sum(cluster prevalence)/# of clusters
Now, an example of the problem is that there are some EUs that have 25 clusters and some that have 30. Please help me put some code in the loop to automatically use the correct number of clusters in the calculation based on the EU code and linking to another table- see line 10 (I keep thinking of VLOOKUP in Excel).
The "dataset" table is grouped by EU&Cluster and has the cluster-level prevalence values. Here is an example of how it would look:
eu cluster cluster_prev
640 1 0.23
640 2 0.78
...
640 25 0.78
678 1 0.97
...
678 27 1.2
681 1 0
...
681 31 0.78
Then there's a table called "cluster_count" which is grouped by EU and has 2 columns: EU & cluster_ct (number of clusters in the EU)... this is the part I can't figure out how to incorporate. Here is an example of how cluster_count would look:
EU cluster_ct
640 25
678 27
681 31
Here is the code:
#Load, transform data
dataset <- read.csv("ttprev_cluster.csv")
str(dataset)
dataset$eu <- as.factor(dataset$EU)
dataset$cluster <- as.factor(dataset$CLUSTER)
dataset$cluster_prev <- dataset$adj_tt
#Boot statistic function
clustermean <- function(df, i) {
#this is the number that I want to replace with code
num_clusters <- 25
r <- round(runif(num_clusters, 1, nrow(df)))
df2 <- numeric()
for (i in 1:num_clusters)
df2[i] <- df[r[i],]$cluster_prev
return(mean(df2))
}
#create empty data frame for results
bootResult <- data.frame(eu=character(), bootmean=numeric(), se=numeric(), ci95_low=numeric(), ci95_high=numeric(), stringsAsFactors=FALSE)
#Bootstrap function, looped over each EU
library(boot)
num_reps <- 10000
for (i in 1:nlevels(dataset$eu)) {
data2 <- subset(dataset, eu==levels(eu)[i])
b <- boot(data2, clustermean, num_reps)
m <- mean(b$t)
se <- sd(b$t)
#calculate 2.5/97.5 percentiles as Confidence Interval
q <- quantile(b$t, c(0.025, 0.975))
ci_lower <- q[1]
ci_upper <- q[2]
}