Aggregate by customized row breaks

Question

I would take to take the mean of the columns by a certain break of the rows. For instance:

set.seed(0)
dt = data.frame(cbind(rnorm(10, 0, 1), rnorm(10, 0, 2), rnorm(10, 0, 3)))
breaks = c(0,1,2,4,8,Inf)

The only solution I can think of is manually plug in row index then use colMeans or use loop, which is painful since I have a much longer break rule. My expected results are as following:

re = rbind(colMeans(dt[1, ]), colMeans(dt[2, ]), colMeans(dt[3:4, ]), 
      colMeans(dt[5:8, ]), colMeans(dt[9:10, ]))

Any advice (or direct to a duplicated answer) is appreciated!

If it is a custom break, you may have to come up with the index manually. If there is any pattern, it can be programmatically created — akrun, Jul 12 '19 at 14:10
Is the last bin supposed to be left-closed? or is it a typo? `colMeans(dt[8:10, ]` or `colMeans(dt[9:10, ]`? — acylam, Jul 12 '19 at 14:29
@avid_useR It was a typo - sorry about that! I fixed it for further reference. — Rachel Zhang, Jul 13 '19 at 01:34

akrun · Answer 1 · 2019-07-12T14:18:54.750

We can split the dataset into a list of data.frame using group created with rep, loop through the list with sapply, get the colMeans

re1 <- t(sapply(split(dt, rep(1:5, c(1, 1, 2, 3, 3))), colMeans))

If the rows are overlapping

library(tidyverse)
dt %>% 
   mutate(n = case_when(row_number() == 8 ~ 2, TRUE ~ 1)) %>%
   uncount(n) %>% 
   group_by(grp = rep(1:5, c(1, 1, 2, 4, 3))) %>% 
   summarise_all(mean) %>%
   ungroup %>%
   select(-grp)
# A tibble: 5 x 3
#      X1     X2     X3
#   <dbl>  <dbl>  <dbl>
#1  1.26   1.53  -0.673
#2 -0.326 -1.60   1.13 
#3  1.30  -1.44   1.41 
#4 -0.587 -0.675  0.631
#5  0.701 -1.13  -1.93

-checking OP's output

re
#             X1         X2         X3
#[1,]  1.2629543  1.5271869 -0.6728037
#[2,] -0.3262334 -1.5980185  1.1321869
#[3,]  1.3011143 -1.4371186  1.4062888
#[4,] -0.5871490 -0.6752118  0.6309875
#[5,]  0.7013886 -1.1291842 -1.9288270

acylam · Accepted Answer · 2019-07-13T02:16:04.300

We can easily do this with group_by and summarize in dplyr. For the grouping variable, we can use the Base R function cut, which takes a vector of breaks (perfect for your case) and bins the row_number (this is a helper function from dplyr):

library(dplyr)

dt %>%
  group_by(grp = cut(row_number(), breaks)) %>%
  summarize_all(mean) %>%
  select(-grp)

Output:

# A tibble: 5 x 3
      X1     X2     X3
   <dbl>  <dbl>  <dbl>
1  1.26   1.53  -0.673
2 -0.326 -1.60   1.13 
3  1.30  -1.44   1.41 
4 -0.587 -0.675  0.631
5  1.20  -0.802 -1.86

or with Base R:

dt$grp <- cut(1:nrow(dt), breaks)
aggregate(cbind(X1, X2, X3) ~ grp, data = dt, FUN = mean)

Output:

      grp         X1         X2         X3
1   (0,1]  1.2629543  1.5271869 -0.6728037
2   (1,2] -0.3262334 -1.5980185  1.1321869
3   (2,4]  1.3011143 -1.4371186  1.4062888
4   (4,8] -0.5871490 -0.6752118  0.6309875
5 (8,Inf]  1.1994431 -0.8018551 -1.8568098

Note that the output might seems that the numbers are rounded, but it is only for display. The underlying data keeps the original precision.

@RachelZhang also check out the base R equivalent if you don't want to install any packages — acylam, Jul 13 '19 at 02:16

Aggregate by customized row breaks

2 Answers2