0

I am using bootstrapping to get 95% CIs for mean calculations for several Evaluation Units (EUs). The calculation without bootstrapping is

EU prevalence = sum(cluster prevalence)/# of clusters

Now, an example of the problem is that there are some EUs that have 25 clusters and some that have 30. Please help me put some code in the loop to automatically use the correct number of clusters in the calculation based on the EU code and linking to another table- see line 10 (I keep thinking of VLOOKUP in Excel).

The "dataset" table is grouped by EU&Cluster and has the cluster-level prevalence values. Here is an example of how it would look:

eu    cluster    cluster_prev
640   1          0.23
640   2          0.78
...
640   25         0.78
678   1          0.97
...
678   27         1.2
681   1          0
...
681   31         0.78

Then there's a table called "cluster_count" which is grouped by EU and has 2 columns: EU & cluster_ct (number of clusters in the EU)... this is the part I can't figure out how to incorporate. Here is an example of how cluster_count would look:

EU    cluster_ct
640   25
678   27
681   31

Here is the code:

#Load, transform data
dataset <- read.csv("ttprev_cluster.csv") 
str(dataset)
dataset$eu <- as.factor(dataset$EU)
dataset$cluster <- as.factor(dataset$CLUSTER)
dataset$cluster_prev <- dataset$adj_tt

#Boot statistic function 
clustermean <- function(df, i) {

    #this is the number that I want to replace with code
    num_clusters <- 25 

    r <- round(runif(num_clusters, 1, nrow(df)))

    df2 <- numeric()
    for (i in 1:num_clusters) 
        df2[i] <- df[r[i],]$cluster_prev

    return(mean(df2))  
}

#create empty data frame for results
bootResult <- data.frame(eu=character(), bootmean=numeric(), se=numeric(), ci95_low=numeric(), ci95_high=numeric(), stringsAsFactors=FALSE)

#Bootstrap function, looped over each EU
library(boot)
num_reps <- 10000 
for (i in 1:nlevels(dataset$eu)) {
    data2 <- subset(dataset, eu==levels(eu)[i])
    b <- boot(data2, clustermean, num_reps)
    m <- mean(b$t)
    se <- sd(b$t)

    #calculate 2.5/97.5 percentiles as Confidence Interval
    q <- quantile(b$t, c(0.025, 0.975))
    ci_lower <- q[1]
    ci_upper <- q[2]
}
Siguza
  • 21,155
  • 6
  • 52
  • 89
beck777
  • 49
  • 1
  • 7
  • What is the error with the code you have so far? – dg99 Feb 13 '15 at 18:38
  • sorry- I have tried so many ways of doing it that I didn't know which erroneous method to put. The code above works... the problem is that I have to manually change that number (25) every time. – beck777 Feb 13 '15 at 18:44

2 Answers2

0

The preffered method is to take advantage of the ... argument of boot(). as in:

#Boot statistic function
clustermean <- function(df, 
                        i,
                        num_clusters # num_clusters is now an artument to clustermean
                        ) {
    # blah blah blah
}


# blah blah blah

for (i in blahBlahBlah) {

    #calculate num_clusters here
    num_clusters <- cluster_count[cluster_count$EU == levels(eu)[i],
                                  'cluster_ct']

    b <- boot(data2, 
              clustermean, 
              num_reps,

              # additional arguments supplied to `boot()` that
              # don't match the formal arguments to boot 
              # are passed on to the 'statistic' function:

              # (note that you have to name this argument so 
              # it isn't matched positionally)

              num_clusters=num_clusters) # 


    # blah blah blah 

}
Jthorpe
  • 9,756
  • 2
  • 49
  • 64
  • Thank you very much for this. I haven't been able to get it to work yet, but I am having a colleague take a look at it to help me see what I'm missing. I will let everyone know if it works! – beck777 Feb 16 '15 at 22:02
0

a different colleague helped explain the syntax within the clustermean argument to me a bit more and I ended up going with the following (and it works!!):

num_clusters <- nrow(df)
David Arenburg
  • 91,361
  • 17
  • 137
  • 196
beck777
  • 49
  • 1
  • 7