How can I use dplyr to get bootstrap quantile estimates by group?

Question

I am trying to match the rms package function 'smean.cl.boot' bootstrapped confidence intervals using dplyr (method 1). However, I am not able to bootstrap single columns inside the dplyr call. Could someone show me how to re-sample a single column, get the means of each group, and finally estimate quantiles of that column?

Please consider this small dataset. I am using the plyr package to summarize the grouped data before estimating the quantiles but for some reason I am getting different results than the rms package.

require(rms)
require(dplyr)
require(plyr)

fish <- structure(list(wk = c(1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 
2, 3, 3, 3, 3, 3, 3, 3, 4, 4, 4, 4, 4, 4, 4, 5, 5, 5, 5, 5, 5, 
5, 6, 6, 6, 6, 6, 6, 6, 7, 7, 7, 7, 7, 7, 7, 8, 8, 8, 8, 8, 8, 
8, 9, 9, 9, 9, 9, 9, 9, 10, 10, 10, 10, 10, 10, 10), pd = c(317.308439683869, 
0, 126.719553152898, NA, NA, NA, NA, 2671.6, 3540.6976744186, 
1270.35740604274, 1067.69362430466, 688.099646524154, 317.444499806234, 
420.941879550524, 280.475476696762, 250.681324772507, 159.048160622895, 
258.125109208457, 450.868907331836, 0, 120.83949704142, 244.794377928162, 
0, 226.610029158717, 0, NA, NA, NA, 0, 0, 776.419523429887, 0, 
0, 5572.7956254272, NA, 0, 235.711495898971, 0, 0, 0, 0, 0, 0, 
158.796322731685, 0, 0, 0, 278.669954021457, 0, 0, 0, 0, 0, 0, 
0, NA, 623.451776649746, 0, 440.704258124564, 0, 69.0758191406588, 
0, 0, 51.2873010185801, 26.8224496254879, 104.366153205662, 0, 
71.1744651415584, 0, 0)), .Names = c("wk", "pd"), row.names = c(NA, 
70L), class = "data.frame")
fish

# Method 1
fish <- na.omit(fish)
x <- data.frame(boot=1:1000) %>%
  group_by(boot) %>% 
  do(sample_n(fish, nrow(fish), replace=TRUE)) %>%
  group_by(boot,wk)
  plyr::ddply(x,'wk',summarise,Mean=mean(pd),lower=quantile(pd,prob=0.025),upper=quantile(pd,prob=0.975))

    wk       Mean      Lower      Upper
1   1  148.00933   0.000000  317.30844
2   2 1425.26210 643.274777 2322.42315
3   3  217.14835 125.537283  304.37517
4   4  117.85110   0.000000  235.70220
5   5 1058.20252   0.000000 2915.80107
6   6   33.67307   0.000000  101.01921
7   7   62.49518   0.000000  142.11517
8   8    0.00000   0.000000    0.00000
9   9  161.89026   9.867974  356.25816
10 10   36.23577  11.133767   66.07178

 #Method 2 
boots <- fish %>%
  group_by(wk) %>%
  do(data.frame(rbind(smean.cl.boot(.$pd))))
  data.frame(boots)
   wk       Mean    lower     upper
1   1  145.71624   0.0000  317.3084
2   2 1490.79383 317.4445 3540.6977
3   3  215.44592   0.0000  450.8689
4   4  124.15618   0.0000  244.7944
5   5  976.88218   0.0000 5572.7956
6   6   27.88334   0.0000  235.7115
7   7   52.79893   0.0000  278.6700
8   8    0.00000   0.0000    0.0000
9   9  165.98724   0.0000  623.4518
10 10   35.66937   0.0000  104.3662

Am I missing a step on method 1?

The first way, I think you are bootstrap sampling your entire dataset and calculating CIs for each week. The second way you are bootstrap sampling each week and calculating CIs. It makes sense that the first one has higher variance because some of the samples will have very few observations for some weeks, where the second will will always have the same number of observations bootstrapped for each week. — Gregor Thomas, Oct 15 '17 at 19:31
Btw, you can change your last line of #1 to `dplyr::summarise(Mean = mean(pd), lower = quantile(pd, prob = 0.025), upper = quantile(pd, prob=0.975))` to get the same results without needing the `plyr` dependency. It will probably be quite a bit faster (though still using the high-variance method). — Gregor Thomas, Oct 15 '17 at 19:33
I understand the high variance used in the first method. I wonder if there is a better way to match the rms results. Thanks for pointing out dplyr. — Salvador, Oct 15 '17 at 19:38
Gregor, Yes. I want method 1 to match method 2. I want to be able to re-sample only one column(pd) instead of re-sampling the entire dataset. Is it possible to re-sample only one column inside the dplyr call? — Salvador, Oct 16 '17 at 17:19
Just cleaned up some of the comments so they're not distracting. Good edits! — Gregor Thomas, Oct 16 '17 at 17:59

How can I use dplyr to get bootstrap quantile estimates by group?

0 Answers0