Efficiently create summary table with min, max, mean, SE and sample size by category

Question

I have a dataset that shows the walking speed of different age classes, measured in two different periods. Here is a reproducible example:

Period<-rep(c(1,2),times=c(9,18))
Class<-rep(c("child","teen","adult","toddler","child","teen","adult"),times=c(2,3,4,3,4,5,6))
Speed<-c(2,3,3,4,5,5,4,4,6,1,0.7,0.3,1,2,3,2,2,5,5,4,2,3,5,4,6,5,4)
data<-data.frame(Period,Class,Speed)

   Period   Class Speed
1       1   child   2.0
2       1   child   3.0
3       1    teen   3.0
4       1    teen   4.0
5       1    teen   5.0
6       1   adult   5.0
7       1   adult   4.0
8       1   adult   4.0
9       1   adult   6.0
10      2 toddler   1.0
11      2 toddler   0.7
12      2 toddler   0.3
13      2   child   1.0
14      2   child   2.0
15      2   child   3.0
16      2   child   2.0
17      2    teen   2.0
18      2    teen   5.0
19      2    teen   5.0
20      2    teen   4.0
21      2    teen   2.0
22      2   adult   3.0
23      2   adult   5.0
24      2   adult   4.0
25      2   adult   6.0
26      2   adult   5.0
27      2   adult   4.0

I want to create a summary table with the following attributes:

The minimum, maximum and average speed by age class and by period
The standard error of the speed by age class by period
The number of individuals per class by period

I would also like to add three "total" rows:

minimum, maximum, average speeds, standard errors and number of all individuals in Period 1
The same for period 2
The same over both periods

The resulting summary table should look like this:

 Class         Period  mean   min   max    SE     n
   <chr>          <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
 1 adult              1 4.75    4       6 0.479     4
 2 adult              2 4.5     3       6 0.428     2
 3 child              1 2.5     2       3 0.5       3
 4 child              2 2       1       3 0.408     6
 5 teen               1 4       3       5 0.577     4
 6 teen               2 3.6     2       5 0.678     5
 7 toddler            2 0.667   0.3     1 0.203     3
 8 Total.Period2      2 4       2       6 0.408     9
 9 Total.Period1      1 3.06    0.3     6 0.414    18
10 Total             NA 3.37    0.3     6 0.315    27

I can currently do this using functions from the tidyverse and rbind() in R. I've also been using the function std.error from the package plotrix() to calculate the standard errors. However, this is quite a lengthy process and I'd like to make it more efficient. Any suggestions?

I think tidyverse and rbind() (I assume using filter for each Total.Period row) is pretty efficient given how many different calculations and groups you require. Putting it in its own custom function would neaten your code if that's a concern. — Bowhaven, Feb 21 '23 at 17:19
I have already done the code, just wait some minutes so Stack Overflow let me post it hahah — Lucas, Feb 21 '23 at 17:35

score 2 · Answer 1 · answered Feb 21 '23 at 19:23

Using data.table and self defined standard error function std_err

std_err <- function(x, n) sd(x) / sqrt(n)

library(data.table)

setDT(data)

dd1 <- data[, .(mean = mean(Speed), min = min(Speed), max = max(Speed), 
                se = std_err(Speed, .N), n = .N), by=.(Class, Period)]

dd2 <- dd1[, .(Class = paste0("Total.Period_", Period), mean = mean(mean), 
               min = min(min), max = max(max), se = mean(se), 
               n = sum(n)), by=Period]

dd3 <- dd2[, .(Class = "Total", Period = list(Period), mean = mean(mean), 
               min = min(min), max = max(max), se = mean(se), n = sum(n))]

rbindlist(list(dd1, dd2, dd3), use.names=T)
             Class Period      mean min max        se  n
 1:          child      1 2.5000000 2.0   3 0.5000000  2
 2:           teen      1 4.0000000 3.0   5 0.5773503  3
 3:          adult      1 4.7500000 4.0   6 0.4787136  4
 4:        toddler      2 0.6666667 0.3   1 0.2027588  3
 5:          child      2 2.0000000 1.0   3 0.4082483  4
 6:           teen      2 3.6000000 2.0   5 0.6782330  5
 7:          adult      2 4.5000000 3.0   6 0.4281744  6
 8: Total.Period_1      1 3.7500000 2.0   6 0.5186879  9
 9: Total.Period_2      2 2.6916667 0.3   6 0.4293536 18
10:          Total    1,2 3.2208333 0.3   6 0.4740208 27

score 1 · Accepted Answer · answered Feb 21 '23 at 17:42

Try this to calculate the summary:

library(tidyverse)
rbind(data %>% group_by(Class, Period) %>% summarise(mean = mean(Speed), min = min(Speed), max = max(Speed), SE = sd(Speed), n = n()) %>% ungroup(),
      cbind("Class" = c("Period.1","Period.2"),data %>% group_by(Period) %>% summarise(mean = mean(Speed), min = min(Speed), max = max(Speed), SE = sd(Speed), n = n()) %>% ungroup()),
      cbind("Class" = "Total", "Period" = NA,data %>% summarise(mean = mean(Speed), min = min(Speed), max = max(Speed), SE = sd(Speed), n = n()) %>% ungroup()))

As far as I know, this is the fastest way to do it

Efficiently create summary table with min, max, mean, SE and sample size by category

2 Answers2