7

Ok, here is the problem that I would love to solve using an efficient, elegant solution such as data.table or dplyr.

Define:

DT = data.table(group=c(rep("A",3),rep("B",5)),value=c(2,9,2,3,4,1,0,3))

   time group value  
1:    1     A     2  
2:    2     A     9 
3:    3     A     2 
4:    1     B     3     
5:    2     B     4 
6:    3     B     1 
7:    4     B     0 
8:    5     B     3 

What I'm trying to get is a cumulative sum by group of the values divided by their inverse ordering at the moment of time they've been observed.

   time group value    RESULT
1:    1     A     2  2.000000
2:    2     A     9 10.000000
3:    3     A     2  7.166667
4:    1     B     3  3.000000
5:    2     B     4  5.500000
6:    3     B     1  4.000000
7:    4     B     0  2.583333
8:    5     B     3  4.933333

At line 5 the result is: 4/1 + 3/2 = 5.5 Because at time 2, the group B had 2 observations, the last is divided by 1 and the previous by 1. Next in line 6 the result is: 1/1 + 4/2+ 3/3 = 4 Since at time 3, the group B had 3 observations, the last is divided by 1, the previous by 2 and the still previous by 3. At line 7, 0/1 + 1/2 + 4/3 + 3/4 = 2.583333, and so on...

The data is big, so avoiding loops is essential!

alistaire
  • 42,459
  • 4
  • 77
  • 117
EdM
  • 164
  • 7

3 Answers3

6

I'd use matrix algebra:

n_max = DT[, .N, by=group][, max(N)]
m     = matrix(0, n_max, n_max)
m[]   = ifelse( col(m) >= row(m), 1 / (col(m) - row(m) + 1 ), m)

DT[, res := value %*% m[seq_len(.N), seq_len(.N)], by=group ]

   group value       res
1:     A     2  2.000000
2:     A     9 10.000000
3:     A     2  7.166667
4:     B     3  3.000000
5:     B     4  5.500000
6:     B     1  4.000000
7:     B     0  2.583333
8:     B     3  4.933333
Frank
  • 66,179
  • 8
  • 96
  • 180
3

You can *apply across a sequence of the group length, making sequences to index value and, reversed, to divide it by. With dplyr:

library(tidyverse)

DT %>% group_by(group) %>% 
    mutate(result = sapply(seq(n()), function(x){sum(value[seq(x)] / rev(seq(x)))}))

## Source: local data frame [8 x 3]
## Groups: group [2]
## 
##    group value    result
##   <fctr> <dbl>     <dbl>
## 1      A     2  2.000000
## 2      A     9 10.000000
## 3      A     2  7.166667
## 4      B     3  3.000000
## 5      B     4  5.500000
## 6      B     1  4.000000
## 7      B     0  2.583333
## 8      B     3  4.933333

or using purrr::map_dbl instead of sapply,

DT %>% group_by(group) %>% 
    mutate(result = map_dbl(seq(n()), ~sum(value[seq(.x)] / rev(seq(.x)))))

which returns the same thing. You can translate the same logic to base R, as well:

DT$result <- ave(DT$value, 
                 DT$group, 
                 FUN = function(v){sapply(seq_along(v), 
                                          function(x){sum(v[seq(x)] / rev(seq(x)))})})

DT

##   group value    result
## 1     A     2  2.000000
## 2     A     9 10.000000
## 3     A     2  7.166667
## 4     B     3  3.000000
## 5     B     4  5.500000
## 6     B     1  4.000000
## 7     B     0  2.583333
## 8     B     3  4.933333

While I haven't benchmarked, these methods should be fast enough for most jobs. I suspect @Frank's answer is probably faster if speed is paramount, though.

alistaire
  • 42,459
  • 4
  • 77
  • 117
2

If you have memory to spare, you can use a cartesian join to preallocate the rows, so that the operations done within the by's are simpler and can take advantage of data.table's GForce optimization. This may/may not be faster than the other solutions because it basically trades memory for using more optimized code within the by.

> DT[, .SD
     ][DT, on='group', allow.cartesian=T
     ][, setnames(.SD, 'i.time', 'groupRow')
     ][time <= groupRow
     ][, timeRev := .N:1, .(group, groupRow)
     ][, res := value / timeRev
     ][, .(res=sum(res)), .(group, groupRow, i.value)
     ][, groupRow := NULL
     ][, setnames(.SD, 'i.value', 'value')
     ]
   group value    res
1:     A     2  2.000
2:     A     9 10.000
3:     A     2  7.167
4:     B     3  3.000
5:     B     4  5.500
6:     B     1  4.000
7:     B     0  2.583
8:     B     3  4.933
> 
Clayton Stanley
  • 7,513
  • 9
  • 32
  • 46