How to make grouped summary statistics based off of densities in R

Question

Goal: I would like to generate grouped percentiles for each group (hrzn)

I have the following data

# A tibble: 3,500 x 3
    hrzn parameter density
   <dbl>     <dbl>   <dbl>
 1     1    0.0183 0.00914
 2     1    0.0185 0.00905
 3     1    0.0187 0.00897
 4     1    0.0189 0.00888
 5     1    0.0191 0.00880
 6     1    0.0193 0.00872
 7     1    0.0194 0.00864
 8     1    0.0196 0.00855
 9     1    0.0198 0.00847
10     1    0.0200 0.00839

The hrzn is the group, the parameter is a grid of parameter space, and the density is the density for the value in the parameter column.

I would like to generate summary the statistics percentiles 10 to 90 by 10 by hrzn. I am trying to keep this computationally efficient. I know I could sample the parameter with the density as weights, but I am curious is there is a faster way to generate the percentiles from the density without doing a sample.

The data may be obtained with the following

df <- readr::read_csv("https://raw.githubusercontent.com/alexhallam/density_data/master/data.csv")

Allan Cameron · Accepted Answer · 2020-01-25T16:01:59.743

When I load the data from your csv, each of the 5 groups have identical values for parameter and density:

df
#># A tibble: 3,500 x 3
#>    hrzn parameter density
#>   <int>     <dbl>   <dbl>
#> 1     1    0.0183 0.00914
#> 2     1    0.0185 0.00905
#> 3     1    0.0187 0.00897
#> 4     1    0.0189 0.00888
#> 5     1    0.0191 0.00880
#> 6     1    0.0193 0.00872
#> 7     1    0.0194 0.00864
#> 8     1    0.0196 0.00855
#> 9     1    0.0198 0.00847
#>10     1    0.0200 0.00839
#># ... with 3,490 more rows

sapply(1:5, function(x) all(df$parameter[df$hrzn == x] == df$parameter[df$hrzn == 1]))
# [1] TRUE TRUE TRUE TRUE TRUE

sapply(1:5, function(x) all(df$density[df$hrzn == x] == df$density[df$hrzn == 1]))
# [1] TRUE TRUE TRUE TRUE TRUE

I'm not sure if this is a mistake or not, but clearly if you're worried about computation, anything you want to do on all the groups can be done 5 times faster by only doing it on a single group.

Anyway, to get the 10th and 90th centiles for each hrzn, you just need to see which parameter is adjacent to 0.1 and 0.9 on the cumulative distribution function. Let's generalize to working it out for all the groups in case there's an issue with the data or you want to repeat it with different data:

library(dplyr)

df %>% 
  mutate(hrzn = factor(hrzn)) %>%
  group_by(hrzn) %>% 
  summarise(centile_10 = parameter[which(cumsum(density) > .1)[1]],
            centile_90 = parameter[which(cumsum(density) > .9)[1]] )

#># A tibble: 5 x 3
#>  hrzn  centile_10 centile_90
#>  <fct>      <dbl>      <dbl>
#>1 1         0.0204      0.200
#>2 2         0.0204      0.200
#>3 3         0.0204      0.200
#>4 4         0.0204      0.200
#>5 5         0.0204      0.200

Of course, they're all the same for the reasons mentioned above.

If you're worried about computation time (even though the above only takes a few milliseconds), and you don't mind opaque code, you could take advantage of the ordering to cut the cumsum of your entire density column between 0 and 5 in steps of 0.1, to get all the 10th centiles, like this:

summary <- df[which((diff(as.numeric(cut(cumsum(df$density), seq(0,5,.1))) - 1) != 0)) + 1,]
summary <- summary[-(1:5)*10,]
summary$centile <- rep(1:9*10, 5)
summary
#> # A tibble: 45 x 4
#>     hrzn parameter density centile
#>    <int>     <dbl>   <dbl>   <dbl>
#>  1     1    0.0204 0.00824      10
#>  2     1    0.0233 0.00729      20
#>  3     1    0.0271 0.00634      30
#>  4     1    0.0321 0.00542      40
#>  5     1    0.0392 0.00453      50
#>  6     1    0.0498 0.00366      60
#>  7     1    0.0679 0.00281      70
#>  8     1    0.103  0.00199      80
#>  9     1    0.200  0.00114      90
#> 10     2    0.0204 0.00824      10
#> # ... with 35 more rows

Perhaps I have misunderstood you and you are actually working in a 5-dimensional parameter space and want to know the parameter values at the 10th and 90th centiles of 5d density. In that case, you can take advantage of the fact that all groups are the same to calculate the 10th and 90th centiles for the 5-d density by simply taking the 5th root of these two centiles:

df %>% 
  mutate(hrzn = factor(hrzn)) %>%
  group_by(hrzn) %>% 
  summarise(centile_10 = parameter[which(cumsum(density) > .1^.2)[1]],
            centile_90 = parameter[which(cumsum(density) > .9^.2)[1]] )

#> # A tibble: 5 x 3
#>   hrzn  centile_10 centile_90
#>   <fct>      <dbl>      <dbl>
#> 1 1         0.0545      0.664
#> 2 2         0.0545      0.664
#> 3 3         0.0545      0.664
#> 4 4         0.0545      0.664
#> 5 5         0.0545      0.664

Thanks. Looks like you are correct about the densities for the parameters being the same. Other seeds I used to generate the data had slightly different results. Anyways, this is exactly what I was looking for. — Alex, Jan 25 '20 at 16:35

How to make grouped summary statistics based off of densities in R

1 Answers1