0

I want to aggregate some columns of a data frame using a factor (group in the example) but I want to use only the rows with the highest values in a different column (time in the example)

df=data.frame(group=c(rep('a',5),rep('b',5)),
           time=c(1:5,2:6),
           V1=c(1,1,1,2,2,1,1,1,1,1),
           V2=c(2,2,1,1,1,1,1,1,1,5))

I know how to do it using ddply but it's pretty slow

ddply(df,'group',summarize,
      V1=sum(V1[order(time,decreasing = T)[1:2]]),
      V2=sum(V2[order(time,decreasing = T)[1:2]]))

"group" "V1"    "V2"
"a" 4   2
"b" 2   6

Is there a faster way to do it (aggregate or data.table)?

Ronak Shah
  • 377,200
  • 20
  • 156
  • 213
nex
  • 1
  • 1

1 Answers1

0

We can arrange the data by time, group_by time and sum top 2 values using tail.

This can be done using dplyr :

library(dplyr)

df %>%
  arrange(group, time) %>%
  group_by(group) %>%
  summarise_at(vars(V1:V2), ~sum(tail(., 2)))

#  group    V1    V2
#  <fct> <dbl> <dbl>
#1 a         4     2
#2 b         2     6

and in data.table as :

library(data.table)
setDT(df)[order(group, time), lapply(.SD, function(x) sum(tail(x, 2))), 
          .SDcols = c('V1', 'V2'), group]
Ronak Shah
  • 377,200
  • 20
  • 156
  • 213