simplify multiple one to one paired aggregation

Question

I'd like to calculate mean for multiple one to one paired aggregation. For example, I have a data.frame below. I'd like to calculate the mean for for column b1 by sym & a1, and b2 by sym & a2 simultaneously.

   sym a1 a2 b1 b2
1   a  1  2  1  1
2   a  2  2  2  2
3   a  1  2  3  3
4   a  2  2  4  4
5   b  1  1  5  5
6   b  2  1  6  6
7   b  1  1  7  7
8   b  2  1  8  8

Here is my code which uses lapply to iterate over each pair. Is there any more efficient way than this?

df <- data.frame(sym=c(rep('a', 4), rep('b', 4)), a1=rep(1:2, 4), 
                 a2=rep(2:1, each=4), b1=rep(1:8), b2=rep(1:8))

tmp <- ddply(df, "sym", function(x) {

  temp.ls <- lapply(1:2, function(i) {
    t2 <- aggregate(x = x[3+i], by=x[1+i], FUN=function(.){mean(., na.rm = T)})
    colnames(t2) <- c("a", "b")
    t2
  })
  temp.all <- Reduce(function(x, y) merge(x, y, by=c("a"), all=T, sort=T), 
                     temp.ls)
})

jbaums · Accepted Answer · 2014-10-28T17:08:36.703

dplyr makes this pretty straightforward:

library(dplyr)
inner_join(df %>% group_by(sym, a1) %>% summarise(b1.mean=mean(b1)),
           df %>% group_by(sym, a2) %>% summarise(b2.mean=mean(b2)))

# Joining by: "sym"
# Source: local data frame [4 x 5]
# Groups: sym
# 
#   sym a1 b1.mean a2 b2.mean
# 1   a  1       2  2     2.5
# 2   a  2       3  2     2.5
# 3   b  1       6  1     6.5
# 4   b  2       7  1     6.5

If you want a single column for a, and want to fill non-appearing combinations with NA as in your example solution, then left_join is an option:

left_join(df %>% group_by(sym, a=a1) %>% summarise(b1.mean=mean(b1)),
          df %>% group_by(sym, a=a2) %>% summarise(b2.mean=mean(b2)),
          by=c('sym', 'a'))

# Source: local data frame [4 x 4]
# Groups: sym
# 
#   sym a b1.mean b2.mean
# 1   a 1       2      NA
# 2   a 2       3     2.5
# 3   b 1       6     6.5
# 4   b 2       7      NA

Hat-tip to @beginnerR for reminding me about dplyr join operations.

EDIT

In response to the comments, if you have more than two groupings, and want to join all the resulting tables together, then here's one way to do this:

# Example data
set.seed(1)
(d <- data.frame(sym=sample(letters[1:4], 10, replace=T),
           a1=sample(5, 10, replace=TRUE),
           a2=sample(5, 10, replace=TRUE),
           a3=sample(5, 10, replace=TRUE),
           b1=runif(10), b2=runif(10), b3=runif(10)))

#    sym a1 a2 a3        b1         b2         b3
# 1    b  2  5  3 0.8209463 0.47761962 0.91287592
# 2    b  1  2  3 0.6470602 0.86120948 0.29360337
# 3    c  4  4  3 0.7829328 0.43809711 0.45906573
# 4    d  2  1  1 0.5530363 0.24479728 0.33239467
# 5    a  4  2  5 0.5297196 0.07067905 0.65087047
# 6    d  3  2  4 0.7893562 0.09946616 0.25801678
# 7    d  4  1  4 0.0233312 0.31627171 0.47854525
# 8    c  5  2  1 0.4772301 0.51863426 0.76631067
# 9    c  2  5  4 0.7323137 0.66200508 0.08424691
# 10   a  4  2  3 0.6927316 0.40683019 0.87532133

L <- mapply(function(x, y) {
    grpd <- eval(substitute(group_by(d, sym, a=x), list(x=as.name(x))))
    eval(substitute(summarise(grpd, mean(y)), list(y=as.name(y))))
}, paste0('a', 1:3), paste0('b', 1:3), SIMPLIFY=FALSE)

Reduce(function(...) left_join(..., all=T), L)

# Source: local data frame [9 x 5]
# Groups: sym
# 
#   sym a  mean(b1)   mean(b2)   mean(b3)
# 1   a 4 0.6112256         NA         NA
# 2   b 1 0.6470602         NA         NA
# 3   b 2 0.8209463 0.86120948         NA
# 4   c 2 0.7323137 0.51863426         NA
# 5   c 4 0.7829328 0.43809711 0.08424691
# 6   c 5 0.4772301 0.66200508         NA
# 7   d 2 0.5530363 0.09946616         NA
# 8   d 3 0.7893562         NA         NA
# 9   d 4 0.0233312         NA 0.36828101

Good point @beginneR - I've incorporated that into the answer. I noticed that you suggested basically the same solution in the comments 29 sec before I posted - sorry! — jbaums, Oct 28 '14 at 15:56
I appreciate the solution which looks excellent. Is it possible to dynamic reference column in group_by, like a[i]. Also I'd like to left_join above 10 results. Is it possible to do that? — YYY, Oct 28 '14 at 16:24
@YYY - You can do something along [these lines](https://groups.google.com/forum/#!topic/manipulatr/cr9PzNEtz6w), such as: `x <- 'a1'; eval(substitute(group_by(df, x), list(x=as.name(x))))`. Not sure what you mean by "above 10 results". — jbaums, Oct 28 '14 at 16:31
@jbaums Thank you again for the response. I mean I have a1-a20 and b1-b20. Will there be several left_join to combine the 20 result them together? — YYY, Oct 28 '14 at 16:34

simplify multiple one to one paired aggregation

1 Answers1